Hi,
This is our initial proposal for explicit fences everywhere and new memory management that doesn't use BO fences. It's a redesign of how Linux graphics drivers work, and it can coexist with what we have now.
*1. Introduction* (skip this if you are already sold on explicit fences)
The current Linux graphics architecture was initially designed for GPUs with only one graphics queue where everything was executed in the submission order and per-BO fences were used for memory management and CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple queues were added on top, which required the introduction of implicit GPU-GPU synchronization between queues of different processes using per-BO fences. Recently, even parallel execution within one queue was enabled where a command buffer starts draws and compute shaders, but doesn't wait for them, enabling parallelism between back-to-back command buffers. Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler was created to enable all those use cases, and it's the only reason why the scheduler exists.
The GPU scheduler, implicit synchronization, BO-fence-based memory management, and the tracking of per-BO fences increase CPU overhead and latency, and reduce parallelism. There is a desire to replace all of them with something much simpler. Below is how we could do it.
*2. Explicit synchronization for window systems and modesetting*
The producer is an application and the consumer is a compositor or a modesetting driver.
*2.1. The Present request*
As part of the Present request, the producer will pass 2 fences (sync objects) to the consumer alongside the presented DMABUF BO: - The submit fence: Initially unsignalled, it will be signalled when the producer has finished drawing into the presented buffer. - The return fence: Initially unsignalled, it will be signalled when the consumer has finished using the presented buffer.
Deadlock mitigation to recover from segfaults: - The kernel knows which process is obliged to signal which fence. This information is part of the Present request and supplied by userspace. - If the producer crashes, the kernel signals the submit fence, so that the consumer can make forward progress. - If the consumer crashes, the kernel signals the return fence, so that the producer can reclaim the buffer. - A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.
Other window system requests can follow the same idea.
Merged fences where one fence object contains multiple fences will be supported. A merged fence is signalled only when its fences are signalled. The consumer will have the option to redefine the unsignalled return fence to a merged fence.
*2.2. Modesetting*
Since a modesetting driver can also be the consumer, the present ioctl will contain a submit fence and a return fence too. One small problem with this is that userspace can hang the modesetting driver, but in theory, any later present ioctl can override the previous one, so the unsignalled presentation is never used.
*3. New memory management*
The per-BO fences will be removed and the kernel will not know which buffers are busy. This will reduce CPU overhead and latency. The kernel will not need per-BO fences with explicit synchronization, so we just need to remove their last user: buffer evictions. It also resolves the current OOM deadlock.
*3.1. Evictions*
If the kernel wants to move a buffer, it will have to wait for everything to go idle, halt all userspace command submissions, move the buffer, and resume everything. This is not expected to happen when memory is not exhausted. Other more efficient ways of synchronization are also possible (e.g. sync only one process), but are not discussed here.
*3.2. Per-process VRAM usage quota*
Each process can optionally and periodically query its VRAM usage quota and change domains of its buffers to obey that quota. For example, a process allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1 GB. The process can change the domains of the least important buffers to GTT to get the best outcome for itself. If the process doesn't do it, the kernel will choose which buffers to evict at random. (thanks to Christian Koenig for this idea)
*3.3. Buffer destruction without per-BO fences*
When the buffer destroy ioctl is called, an optional fence list can be passed to the kernel to indicate when it's safe to deallocate the buffer. If the fence list is empty, the buffer will be deallocated immediately. Shared buffers will be handled by merging fence lists from all processes that destroy them. Mitigation of malicious behavior: - If userspace destroys a busy buffer, it will get a GPU page fault. - If userspace sends fences that never signal, the kernel will have a timeout period and then will proceed to deallocate the buffer anyway.
*3.4. Other notes on MM*
Overcommitment of GPU-accessible memory will cause an allocation failure or invoke the OOM killer. Evictions to GPU-inaccessible memory might not be supported.
Kernel drivers could move to this new memory management today. Only buffer residency and evictions would stop using per-BO fences.
*4. Deprecating implicit synchronization*
It can be phased out by introducing a new generation of hardware where the driver doesn't add support for it (like a driver fork would do), assuming userspace has all the changes for explicit synchronization. This could potentially create an isolated part of the kernel DRM where all drivers only support explicit synchronization.
Marek
Not going to comment on everything on the first pass...
On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák maraeo@gmail.com wrote:
I'm not sure syncobj is what we want. In the Intel world we're trying to go even further to something we're calling "userspace fences" which are a timeline implemented as a single 64-bit value in some CPU-mappable BO. The client writes a higher value into the BO to signal the timeline. The kernel then provides some helpers for waiting on them reliably and without spinning. I don't expect everyone to support these right away but, If we're going to re-plumb userspace for explicit synchronization, I'd like to make sure we take this into account so we only have to do it once.
This isn't clear to me. Yes, if we're using anything dma-fence based like syncobj, this is true. But it doesn't seem totally true as a general statement.
What do you mean by "all"? All fences that were supposed to be signaled by the hung context?
Is this even really possible? I'm no kernel MM expert (trying to learn some) but my understanding is that the use of per-BO dma-fence runs deep. I would like to stop using it for implicit synchronization to be sure, but I'm not sure I believe the claim that we can get rid of it entirely. Happy to see someone try, though.
This is going to be difficult. On Intel, we have some resources that have to be pinned to VRAM and can't be dynamically swapped out by the kernel. In GL, we probably can deal with it somewhat dynamically. In Vulkan, we'll be entirely dependent on the application to use the appropriate Vulkan memory budget APIs.
--Jason
We already don't have accurate BO fences in some cases. Instead, BOs can have fences which are equal to the last seen command buffer for each queue. It's practically the same as if the kernel had no visibility into command submissions and just added a fence into all queues when it needed to wait for idle. That's already one alternative to BO fences that would work today. The only BOs that need accurate BO fences are shared buffers, and those use cases can be converted to explicit fences.
Removing memory management from all command buffer submission logic would be one of the benefits that is quite appealing.
You don't need to depend on apps for budgeting and placement determination. You can sort buffers according to driver usage, e.g. scratch/spill buffers, shader IO rings, MSAA images, other images, and buffers. Alternatively, you can have just internal buffers vs app buffers. Then you assign VRAM from left to right until you reach the quota. This is optional, so this part can be ignored.
GPU hangs.
What do you mean by "all"? All fences that were supposed to be signaled by the hung context?
Yes, that's one of the possibilities. Any GPU hang followed by a GPU reset can clear VRAM, so all processes should recreate their contexts and reinitialize resources. A deadlock caused by userspace could be handled similarly.
I don't know how timeline fences would work across processes and how resilient they would be to segfaults.
Marek
On Mon, Apr 19, 2021 at 11:48 AM Jason Ekstrand jason@jlekstrand.net wrote:
Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
Well that is exactly what our Windows guys have suggested as well, but it strongly looks like that this isn't sufficient.
First of all you run into security problems when any application can just write any value to that memory location. Just imagine an application sets the counter to zero and X waits forever for some rendering to finish.
Additional to that in such a model you can't determine who is the guilty queue in case of a hang and can't reset the synchronization primitives in case of an error.
Apart from that this is rather inefficient, e.g. we don't have any way to prevent priority inversion when used as a synchronization mechanism between different GPU queues.
Christian.
On Tue, Apr 20, 2021 at 12:15 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
The thing is, with userspace fences security boundary issue prevent moves into userspace entirely. And it really doesn't matter whether the event you're waiting on doesn't complete because the other app crashed or was stupid or intentionally gave you a wrong fence point: You have to somehow handle that, e.g. perhaps with conditional rendering and just using the old frame in compositing if the new one doesn't show up in time. Or something like that. So trying to get the kernel involved but also not so much involved sounds like a bad design to me.
Yeah but you can't have it both ways. Either all the scheduling in the kernel and fence handling is a problem, or you actually want to schedule in the kernel. hw seems to definitely move towards the more stupid spinlock-in-hw model (and direct submit from userspace and all that), priority inversions be damned. I'm really not sure we should fight that - if it's really that inefficient then maybe hw will add support for waiting sync constructs in hardware, or at least be smarter about scheduling other stuff. E.g. on intel hw both the kernel scheduler and fw scheduler knows when you're spinning on a hw fence (whether userspace or kernel doesn't matter) and plugs in something else. Add in a bit of hw support to watch cachelines, and you have something which can handle both directions efficiently.
Imo given where hw is going, we shouldn't try to be too clever here. The only thing we do need to provision is being able to do cpu side waits without spinning. And that should probably be done in a fairly gpu specific way still. -Daniel
Daniel, are you suggesting that we should skip any deadlock prevention in the kernel, and just let userspace wait for and signal any fence it has access to?
Do you have any concern with the deprecation/removal of BO fences in the kernel assuming userspace is only using explicit fences? Any concern with the submit and return fences for modesetting and other producer<->consumer scenarios?
Thanks, Marek
On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
Yeah. If we go with userspace fences, then userspace can hang itself. Not the kernel's problem. The only criteria is that the kernel itself must never rely on these userspace fences, except for stuff like implementing optimized cpu waits. And in those we must always guarantee that the userspace process remains interruptible.
It's a completely different world from dma_fence based kernel fences, whether those are implicit or explicit.
Let me work on the full replay for your rfc first, because there's a lot of details here and nuance. -Daniel
Yeah. If we go with userspace fences, then userspace can hang itself. Not the kernel's problem.
Well, the path of inner peace begins with four words. “Not my fucking problem.”
But I'm not that much concerned about the kernel, but rather about important userspace processes like X, Wayland, SurfaceFlinger etc...
I mean attaching a page to a sync object and allowing to wait/signal from both CPU as well as GPU side is not so much of a problem.
Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.
Regards, Christian.
Am 20.04.21 um 13:16 schrieb Daniel Vetter:
On Tue, Apr 20, 2021 at 1:59 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
For opengl we do all the same guarantees, so if you get one of these you just block until the fence is signalled. Doing that properly means submit thread to support drm_syncobj like for vulkan.
For vulkan we probably want to represent these as proper vk timeline objects, and the vulkan way is to just let the application (well compositor) here deal with it. If they import timelines from untrusted other parties, they need to handle the potential fallback of being lied at. How is "not vulkan's fucking problem", because that entire "with great power (well performance) comes great responsibility" is the entire vk design paradigm.
Glamour will just rely on GL providing nice package of the harsh reality of gpus, like usual.
So I guess step 1 here for GL would be to provide some kind of import/export of timeline syncobj, including properly handling this "future/indefinite fences" aspect of them with submit thread and everything. -Daniel
On Tue, Apr 20, 2021 at 9:10 AM Daniel Vetter daniel@ffwll.ch wrote:
The security aspects are currently an unsolved problem in Vulkan. The assumption is that everyone trusts everyone else to be careful with the scissors. It's a great model!
I think we can do something in Vulkan to allow apps to protect themselves a bit but it's tricky and non-obvious.
--Jason
Sorry for the mega-reply but timezones...
On Tue, Apr 20, 2021 at 6:59 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
🧘
Yup... Sorting out these issues is what makes this a hard problem.
"Just handle it with conditional rendering" is a pretty trite answer. If we have memory fences, we could expose a Vulkan extension to allow them to be read by conditional rendering or by a shader. However, as Daniel has pointed out multiple times, composition pipelines are long and complex and cheap tricks like that aren't something we can rely on for solving the problem. If we're going to solve the problem, we need to make driver-internal stuff nice while still providing something that looks very much like a sync_file with finite time semantics to the composition pipeline. How? That's the question.
Sorry that my initial reply was so turse. I'm not claiming (nor will I ever) that memory fences are an easy solution. They're certainly fraught with potential issues. I do, however, think that they are the basis of the future of synchronization. The fact that Windows 10 and the consoles have been using them to great effect for 5+ years indicates that they do, in fact, work. However, Microsoft has never supported both a memory fence and dma-fence-like model in the same Windows version so there's no prior art for smashing the two models together.
For this, Windows has two solutions. One is that everyone is hang-aware in some sense. It means extra complexity in the window system but some amount of that is necessary if you don't have easy error propagation.
Second is that the fences aren't actually singaled from users space as Daniel suggests but are signaled from the kernel. This means that the kernel is aware of all the fences which are supposed to be signaled from a given context/engine. When a hang occurs, it has a mode where it smashes all timelines which the context is supposed to be signaling to UINT64_MAX, unblocking anything which depends on them. There are a lot of details here which are unclear to me such as what happens if some other operation smashes it back to a lower value. Does it keep smashing to UINT64_MAX until it all clears? I'm not sure.
Yup. The synchronization model that Windows, consoles, and hardware is moving towards is a model where you have memory fences for execution synchronization and explicit memory binding and residency management and then the ask from userspace to the kernel is "put everything in place and then run as fast as you can". I think that last bit is roughly what Marek is asking for here. The difference is in the details on how it all works internally.
Also, just to be clear, my comments here weren't so much "please solve all the problems" as asking that, as we improve explicit synchronization plumbing, we do it with memory fences in mind. I do think they're the future, even if it's a difficult future to get to, and I'm trying to find a path.
--Jason
On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:
I get the feeling you're mixing up a lot of things here that have more nuance, so first some lingo.
- There's kernel based synchronization, based on dma_fence. These come in two major variants: Implicit synchronization, where the kernel attaches the dma_fences to a dma-buf, and explicit synchronization, where the dma_fence gets passed around as a stand-alone object, either a sync_file or a drm_syncobj
- Then there's userspace fence synchronization, where userspace issues any fences directly and the kernel doesn't even know what's going on. This is the only model that allows you to ditch the kernel overhead, and it's also the model that vk uses.
I concur with Jason that this one is the future, it's the model hw wants, compute wants and vk wants. Building an explicit fence world which doesn't aim at this is imo wasted effort.
Now you smash them into one thing by also changing the memory model, but I think that doesn't work:
- Relying on gpu page faults across the board wont happen. I think right now only amd's GFX10 or so has enough pagefault support to allow this, and not even there I'm really sure. Nothing else will anytime soon, at least not as far as I know. So we need to support slightly more hw in upstream than just that. Any plan that's realistic needs to cope with dma_fence for a really long time.
- Pown^WPin All The Things! is probably not a general enough memory management approach. We've kinda tried for years to move away from it. Sure we can support it as an optimization in specific workloads, and it will make stuff faster, but it's not going to be the default I think.
- We live in a post xf86-video-$vendor world, and all these other compositors rely on implicit sync. You're not going to be able to get rid of them anytime soon. What's worse, all the various EGL/vk buffer sharing things also rely on implicit sync, so you get to fix up tons of applications on top. Any plan that's realistic needs to cope with implicit/explicit at the same time together won't work.
- Absolute infuriating, but you can't use page-faulting together with any dma_fence synchronization primitives, whether implicit or explicit. This means until the entire ecosystem moved forward (good luck with that) we have to support dma_fence. The only sync model that works together with page faults is userspace fence based sync.
Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit sync, at least last I checked. Currently this oversynchronizes badly because it's left to the kernel to guess what should be synchronized, and that gets things wrong. What you need there is explicit implicit synchronization:
- on the cs side, userspace must set explicit for which buffers the kernel should engage in implicit synchronization. That's how it works on all other drivers that support more explicit userspace like vk or gl drivers that are internally all explicit. So essentially you only set the implicit fence slot when you really want to, and only userspace knows this. Implementing this without breaking the current logic probably needs some flags.
- the other side isn't there yet upstream, but Jason has patches. Essentially you also need to sample your implicit sync points at the right spot, to avoid oversync on later rendering by the producer. Jason's patch solves this by adding an ioctl to dma-buf to get the current set.
- without any of this things for pure explicit fencing userspace the kernel will simply maintain a list of all current users of a buffer. For memory management, which means eviction handling roughly works like you describe below, we wait for everything before a buffer can be moved.
This should get rid of the oversync issues, and since implicit sync is backed in everywhere right now, you'll have to deal with implicit sync for a very long time.
Next up is reducing the memory manager overhead of all this, without changing the ecosystem.
- hw option would be page faults, but until we have full explicit userspace sync we can't use those. Which currently means compute only. Note that for vulkan or maybe also gl this is quite nasty for userspace, since as soon as you need to switch to dma_fenc sync or implicit sync (winsys buffer, or buffer sharing with any of the current set of extensions) you have to flip your internal driver state around all sync points over from userspace fencing to dma_fence kernel fencing. Can still be all explicit using drm_syncobj ofc.
- next up if your hw has preemption, you could use that, except preemption takes a while longer, so from memory pov really should be done with dma_fence. Plus it has all the same problems in that it requires userspace fences.
- now for making dma_fence O(1) in the fastpath you need the shared dma_resv trick and the lru bulk move. radv/amdvlk use that, but I think radeonsi not yet. But maybe I missed that. Either way we need to do some better kernel work so it can also be fast for shared buffers, if those become a problem. On the GL side doing this will use a lot of the tricks for residency/working set management you describe below, except the kernel can still throw out an entire gpu job. This is essentially what you describe with 3.1. Vulkan/compute already work like this.
Now this gets the performance up, but it doesn't give us any road towards using page faults (outside of compute) and so retiring dma_fence for good. For that we need a few pieces:
- Full new set of userspace winsys protocols and egl/vk extensions. Pray it actually gets adopted, because neither AMD nor Intel have the engineers to push these kind of ecosystems/middleware issues forward on their payrolls. Good pick is probably using drm_syncobj as the kernel primitive for this. Still uses dma_fence underneath.
- Some clever kernel tricks so that we can substitute dma_fence for userspace fences within a drm_syncobj. drm_syncobj already has the notion of waiting for a dma_fence to materialize. We can abuse that to create an upgrade path from dma_fence based sync to userspace fence syncing. Ofc none of this will be on the table if userspace hasn't adopted explicit sync.
With these two things I think we can have a reasonable upgrade path. None of this will be break the world type things though.
Bunch of comments below.
Build this with syncobj timelines and it makes a lot more sense I think. We'll need that for having a proper upgrade path, both on the hw/driver side (being able to support stuff like preempt or gpu page faults) and the ecosystem side (so that we don't have to rev protocols twice, once going to explicit dma_fence sync and once more for userspace sync).
So for kernel based sync imo simplest is to just reuse dma_fence, same rules apply.
For userspace fencing the kernel simply doesn't care how stupid userspace is. Security checks at boundaries (e.g. client vs compositor) is also usersepace's problem and can be handled by e.g. timeouts + conditional rendering on the compositor side. The timeout might be in the compat glue, e.g. when we stall for a dma_fence to materialize from a drm_syncobj. I think in vulkan this is defacto already up to applications to deal with entirely if they deal with untrusted fences.
Nope, we can't just shrug off all deadlocks with "gpu reset rolls in". For one, with userspace fencing the kernel isn't aware of any deadlocks, you fundamentally can't tell "has deadlocked" from "is still doing useful computations" because that amounts to solving the halting problem.
Any programming model we come up with where both kernel and userspace are involved needs to come up with rules where at least non-evil userspace never deadlocks. And if you just allow both then it's pretty easy to come up with scenarios where both userspace and kernel along are deadlock free, but interactions result in hangs. That's why we've recently documented all the corner cases around indefinite dma_fences, and also why you can't use gpu page faults currently anything that uses dma_fence for sync.
That's why I think with userspace fencing the kernel simply should not be involved at all, aside from providing optimized/blocking cpu wait functionality.
What's "the current OOM deadlock"?
10-20 years I'd say before that's even an option. -Daniel
Marek
Hi Daniel,
Am 20.04.21 um 14:01 schrieb Daniel Vetter:
It's even worse. GFX9 has enough support so that in theory can work.
Because of this Felix and his team are working on HMM support based on this generation.
On GFX10 some aspects of it are improved while others are totally broken again.
How about this: 1. We extend drm_syncobj to be able to contain both classic dma_fence as well as being used for user fence synchronization.
We already discussed that briefly and I think we should have a rough plan for this in our heads.
2. We allow attaching an drm_syncobj on dma_resv for implicit sync.
This requires that both the consumer as well as the producer side will support user fence synchronization.
We would still have quite a bunch of limitations, especially we would need to adjust all the kernel consumers of classic dma_resv objects. But I think it should be doable.
Regards, Christian.
Hi,
On Tue, 20 Apr 2021 at 13:01, Daniel Vetter daniel@ffwll.ch wrote:
This should get rid of the oversync issues, and since implicit sync is
backed in everywhere right now, you'll have to deal with implicit sync for a very long time.
Depends what you mean by 'implicit sync'. ;)
Getting userspace (Vulkan WSI, EGL, Wayland compositors, browsers, media clients) over to explicit sync is easy, _provided_ that the explicit sync gives us the same guarantees as implicit sync, i.e. completes in bounded time, GPU/display work can be flushed to the kernel predicated on fence completion with the kernel handling synchronisation and scheduling. It's just a matter of typing, and until now we haven't had a great reason to do that typing. Now we do have that reason, so we are implementing it. Whether it's dma_fence or drm_syncobj is mostly immaterial; we can encode in protocol requirements that you can't try to use wait-before-signal with drm_syncobj and you'll get killed if you try.
Getting that userspace over to fully userspace-based sync (wait-before-signal or wait-never-signal, no kernel assistance but you just have to roll your own polling or signal handling on either CPU or GPU side) is not easy. It might never happen, because it's an extraordinary amount of work, introduces a huge amount of fragility into a super-critical path, and and so far it's not clear that it's a global performance improvement for the whole system, just shifting performance problems from kernel to userspace, and probably (AFAICT) making them worse in addition to the other problems it brings.
What am I missing?
Cheers, Daniel
On Tue, Apr 20, 2021 at 3:04 PM Daniel Stone daniel@fooishbar.org wrote:
Nothing I think.
Which is why I'm arguing that kernel based sync with all the current dma_fence guarantees is probably going to stick around for something close to forever, and we need to assume so.
Only in specific cases does full userspace sync make sense imo: - anything compute, excluding using compute/shaders to create displayable buffers, but compute as in your final target is writing some stuff to files and never interacting with any winsys. Those really care because "run a compute kernel for a few hours" isn't supported without userspace sync, and I don't think ever will. - maybe vulkan direct display, once/if we have the extensions for atomic kms wired up - maybe someone wants to write a vulkan based compositor and deal with all this themselves. That model I think would also imply that they deal with all the timeouts and fallbacks, irrespective of whether underneath we actually run on dma_fence timeline syncobjs or userspace fence timeline syncobjs.
From about 2 years of screaming at this stuff it feels like this will
be a pretty exhaustive list for the next 10 years. Definitely doesn't include your random linux desktop wayland compositor stack. But there's definitely some are specific areas where people care enough for all the pain. For everyone else it's all the other pieces I laid out.
This also means that I don't think we now have that impedus to start typing all the explicit sync protocol/compositor bits, since: - the main driver is compute stuff, that needs mesa work (well vk/ocl plus all the various repainted copies of cuda) - with the tricks to make implicit sync work more like explicit sync the oversyncing can be largely solved without protocol work -Daniel
Hi Marek,
On Mon, 19 Apr 2021 at 11:48, Marek Olšák maraeo@gmail.com wrote:
So the 'present request' is an ioctl, right? Not a userspace construct like it is today? If so, how do we correlate the two?
The terminology is pretty X11-centric so I'll assume that's what you've designed against, but Wayland and even X11 carry much more auxiliary information attached to a present request than just 'this buffer, this swapchain'. Wayland latches a lot of data on presentation, including non-graphics data such as surface geometry (so we can have resizes which don't suck), window state (e.g. fullscreen or not, also so we can have resizes which don't suck), and these requests can also cascade through a tree of subsurfaces (so we can have embeds which don't suck). X11 mostly just carries timestamps, which is more tractable.
Given we don't want to move the entirety of Wayland into kernel-visible objects, how do we synchronise the two streams so they aren't incoherent? Taking a rough stab at it whilst assuming we do have DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in kernel space, which the producer would create and ?? export a FD from, that the compositor would ?? import.
As part of the Present request, the producer will pass 2 fences (sync
We have already have this in Wayland through dma_fence. I'm relaxed about this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter of typing. X11 has patches to DRI3 to support dma_fence, but they never got merged because it was far too invasive to a server which is no longer maintained.
Currently in Wayland the return fence (again a dma_fence) is generated by the compositor and sent as an event when it's done, because we can't have speculative/empty/future fences. drm_syncobj would make this possible, but so far I've been hesitant because I don't see the benefit to it (more below).
Same as today with dma_fence. Less true with drm_syncobj if we're using timelines.
This is only a change if the producer is now allowed to submit a fence before it's flushed the work which would eventually fulfill that fence. Using dma_fence has so far isolated us from this.
'The consumer' is problematic, per below. I think the wording you want is 'if no references are held to the submitted present object'.
Which other window system requests did you have in mind? Again, moving the entirety of Wayland's signaling into the kernel is a total non-starter. Partly because it means our entire protocol would be subject to the kernel's ABI rules, partly because the rules and interdependencies between the requests are extremely complex, but mostly because the kernel is just a useless proxy: it would be forced to do significant work to reason about what those requests do and when they should happen, but wouldn't be able to make those decisions itself so would have to just punt everything to userspace. Unless we have eBPF compositors.
An elaboration of how this differed from drm_syncobj would be really helpful here. I can make some guesses based on the rest of the mail, but I'm not sure how accurate they are.
This is also problematic. It's not just KMS, but media codecs too - V4L doesn't yet have explicit fencing, but given the programming model of codecs and how deeply they interoperate, but it will.
Rather than client (GPU) -> compositor (GPU) -> compositor (KMS), imagine you're playing a Steam game on your Chromebook which you're streaming via Twitch or whatever. The full chain looks like: * Steam game renders with GPU * Xwayland in container receives dmabuf, forwards dmabuf to Wayland server (does not directly consume) * Wayland server (which is actually Chromium) receives dmabuf, forwards dmabuf to Chromium UI process * Chromium UI process forwards client dmabuf to KMS for direct scanout * Chromium UI process _also_ forwards client dmabuf to GPU process * Chromium GPU process composites Chromium UI + client dmabuf + webcam frame from V4L to GPU composition job * Chromium GPU process forwards GPU composition dmabuf (not client dmabuf) to media codec for streaming
So, we don't have a 1:1 producer:consumer relationship. Even if we accept it's 1:n, your Chromebook is about to burst into flames and we're dropping frames to try to keep up. Some of the consumers are FIFO (the codec wants to push things through in order), and some of them are mailbox (the display wants to get the latest content, not from half a second ago before the other player started jumping around and now you're dead). You can't reason about any of these dependencies ahead of time from a producer PoV, because userspace will be making these decisions frame by frame. Also someone's started using the Vulkan present-timing extension because life wasn't confusing enough already.
As Christian and Daniel were getting at, there are also two 'levels' of explicit synchronisation.
The first (let's call it 'blind') is plumbing a dma_fence through to be passed with the dmabuf. When the client submits a buffer for presentation, it submits a dma_fence as well. When the compositor is finished with it (i.e. has flushed the last work which will source from that buffer), it passes a dma_fence back to the client, or no fence if required (buffer was never accessed, or all accesses are known to be fully retired e.g. the last fence accessing it has already signaled). This is just a matter of typing, and is supported by at least Weston. It implies no scheduling change over implicit fencing in that the compositor can be held hostage by abusive clients with a really long compute shader in their dependency chain: all that's happening is that we're plumbing those synchronisation tokens through userspace instead of having the kernel dig them up from dma_resv. But we at least have a no-deadlock guarantee, because a dma_fence will complete in bounded time.
The second (let's call it 'smart') is ... much more than that. Not only does the compositor accept and generate explicit synchronisation points for the client, but those synchronisation points aren't dma_fences, but may be wait-before-signal, or may be wait-never-signal. So in order to avoid a terminal deadlock, the compositor has to sit on every synchronisation point and check before it flushes any dependent work that it has signaled, or will at least signal in bounded time. If that guarantee isn't there, you have to punt and see if anything happens at your next repaint point. We don't currently have this support in any compositor, and it's a lot more work than blind.
Given the interdependencies I've described above for Wayland - say a resize case, or when a surface commit triggers a cascade of subsurface commits - GPU-side conditional rendering is not always possible. In those cases, you _must_ do CPU-side waits and keep both sets of state around. Pain.
Typing all that out has convinced me that the current proposal is a net loss in every case.
Complex rendering uses (game engine with a billion draw calls, a billion BOs, complex sync dependencies, wait-before-signal and/or conditional rendering/descriptor indexing) don't need the complexity of a present ioctl and checking whether other processes have crashed or whatever. They already have everything plumbed through for this themselves, and need to implement so much infrastructure around it that they don't need much/any help from the kernel. Just give them a sync primitive with almost zero guarantees that they can map into CPU & GPU address space, let them go wild with it. drm_syncobj_plus_footgun. Good luck.
Simple presentation uses (desktop, browser, game) don't need the hyperoptimisation of sync primitives. Frame times are relatively long, and you can only have so many surfaces which aren't occluded. Either you have a complex scene to composite, in which case the CPU overhead of something like dma_fence is lower than the CPU overhead required to walk through a single compositor repaint cycle anyway, or you have a completely trivial scene to composite and you can absolutely eat the overhead of exporting and scheduling like two fences in 10ms.
Complex presentation uses (out-streaming, media sources, deeper presentation chains) make the trivial present ioctl so complex that its benefits evaporate. Wait-before-signal pushes so much complexity into the compositor that you have to eat a lot of CPU overhead there and lose your ability to do pipelined draws because you have to hang around and see if they'll ever complete. Cross-device usage means everyone just ends up spinning on the CPU instead.
So, can we take a step back? What are the problems we're trying to solve? If it's about optimising the game engine's internal rendering, how would that benefit from a present ioctl instead of current synchronisation?
If it's about composition, how do we balance the complexity between the kernel and userspace? What's the global benefit from throwing our hands in the air and saying 'you deal with it' to all of userspace, given that existing mailbox systems making frame-by-frame decisions already preclude deep/speculative pipelining on the client side?
Given that userspace then loses all ability to reason about presentation if wait-before-signal becomes a thing, do we end up with a global performance loss by replacing the overhead of kernel dma_fence handling with userspace spinning on a page? Even if we micro-optimise that by allowing userspace to be notified on access, is the overhead of pagefault -> kernel signal handler -> queue signalfd notification -> userspace event loop -> read page & compare to expected value, actually better than dma_fence?
Cheers, Daniel
It's still early in the morning here and I'm not awake yet so sorry if this comes out in bits and pieces...
On Tue, Apr 20, 2021 at 7:43 AM Daniel Stone daniel@fooishbar.org wrote:
IMO, there are two problems being solved here which are related in very subtle and tricky ways. They're also, admittedly, driver problems, not really winsys problems. Unfortunately, they may have winsys implications.
First, is better/real timelines for Vulkan and compute. With VK_KHR_timeline_semaphore, we introduced the timeline programming model to Vulkan. This is a massively better programming model for complex rendering apps which want to be doing all sorts of crazy. It comes with all the fun toys including wait-before-signal and no timeouts on any particular time points (a single command buffer may still time out). Unfortunately, the current implementation involves a lot of driver complexity, both in user space and kernel space. The "ideal" implementation for timelines (which is what Win10 does) is to have a trivial implementation where each timeline is a 64-bit integer living somewhere, clients signal whatever value they want, and you just throw the whole mess at the wall and hope the scheduler sorts it out. I'm going to call these "memory fences" rather than "userspace fences" because they could, in theory, be hidden entirely inside the kernel.
We also want something like this for compute workloads. Not only because Vulkan and level0 provide this as part of their core API but because compute very much doesn't want dma-fence guarantees. You can, in theory, have a compute kernel sitting there running for hours and it should be ok assuming your scheduler can preempt and time-slice it with other stuff. This means that we can't ever have a long-running compute batch which triggers a dma-fence. We have to be able to trigger SOMETHING at the ends of those batches. What do we use? TBD but memory fences are the current proposal.
The second biting issue is that, in the current kernel implementation of dma-fence and dma_resv, we've lumped internal synchronization for memory management together with execution synchronization for userspace dependency tracking. And we have no way to tell the difference between the two internally. Even if user space is passing around sync_files and trying to do explicit sync, once you get inside the kernel, they're all dma-fences and it can't tell the difference. If we move to a more userspace-controlled synchronization model with wait-before-signal and no timeouts unless requested, regardless of the implementation, it plays really badly dma-fence. And, by "badly" I mean the two are nearly incompatible. From a user space PoV, it means it's tricky to provide the finite time dma-fence guarantee. From a kernel PoV, it's way worse. Currently, the way dma-fence is constructed, it's impossible to deadlock assuming everyone follows the rules. The moment we allow user space to deadlock itself and allow those deadlocks to leak into the kernel, we have a problem. Even if we throw in some timeouts, we still have a scenario where user space has one linearizable dependency graph for execution synchronization and the kernel has a different linearizable dependency graph for memory management and, when you smash them together, you may have cycles in your graph.
So how do we sort this all out? Good question. It's a hard problem. Probably the hardest problem here is the second one: the intermixing of synchronization types. Solving that one is likely going to require some user space re-plumbing because all the user space APIs we have for explicit sync are built on dma-fence.
--Jason
Hi,
On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand jason@jlekstrand.net wrote:
It's still early in the morning here and I'm not awake yet so sorry if this comes out in bits and pieces...
No problem, it's helpful. If I weren't on this thread I'd be attempting to put together a 73-piece chest of drawers whose instructions are about as clear as this so far, so I'm in the right head space anyway.
Yeah ... bingo.
First, is better/real timelines for Vulkan and compute. [...]
We also want something like this for compute workloads. [...]
Totally understand and agree with all of this. Memory fences seem like a good and useful primitive here.
Funny, because 'lumped [the two] together' is exactly the crux of my issues ...
If we move
Stop here, because ...
I would go further than that, and say completely, fundamentally, conceptually, incompatible.
Gotcha.
Firstly, let's stop, as you say, lumping things together. Timeline semaphores and compute's GPU-side spinlocks etc, are one thing. I accept those now have a hard requirement on something like memory fences, where any responsibility is totally abrogated. So let's run with that in our strawman: Vulkan compute & graphics & transfer queues all degenerate to something spinning (hopefully GPU-assisted gentle spin) on a uint64 somewhere. The kernel has (in the general case) no visibility or responsibility into these things. Fine - that's one side of the story.
But winsys is something _completely_ different. Yes, you're using the GPU to do things with buffers A, B, and C to produce buffer Z. Yes, you're using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition job might depend on a Chromium composition job which depends on GTA's render job which depends on GTA's compute job which might take a year to complete. Mutter's composition job needs to complete in 'reasonable' (again, FSVO) time, no matter what. The two are compatible.
How? Don't lump them together. Isolate them aggressively, and _predictably_ in a way that you can reason about.
What clients do in their own process space is their own business. Games can deadlock themselves if they get wait-before-signal wrong. Compute jobs can run for a year. Their problem. Winsys is not that, because you're crossing every isolation boundary possible. Process, user, container, VM - every kind of privilege boundary. Thus far, dma_fence has protected us from the most egregious abuses by guaranteeing bounded-time completion; it also acts as a sequencing primitive, but from the perspective of a winsys person that's of secondary importance, which is probably one of the bigger disconnects between winsys people and GPU driver people.
Anyway, one of the great things about winsys (there are some! trust me) is we don't need to be as hopelessly general as for game engines, nor as hyperoptimised. We place strict demands on our clients, and we literally kill them every single time they get something wrong in a way that's visible to us. Our demands on the GPU are so embarrassingly simple that you can run every modern desktop environment on GPUs which don't have unified shaders. And on certain platforms who don't share tiling formats between texture/render-target/scanout ... and it all still runs fast enough that people don't complain.
We're happy to bear the pain of being the ones setting strict and unreasonable expectations. To me, this 'present ioctl' falls into the uncanny valley of the kernel trying to bear too much of the weight to be tractable, whilst not bearing enough of the weight to be useful for winsys.
So here's my principles for a counter-strawman:
Remove the 'return fence'. Burn it with fire, do not look back. Modern presentation pipelines are not necessarily 1:1, they are not necessarily FIFO (as opposed to mailbox), and they are not necessarily round-robin either. The current proposal provides no tangible benefits to modern userspace, and fixing that requires either hobbling userspace to remove capability and flexibility (ironic given that the motivation for this is all about userspace flexibility?), or pushing so much complexity into the kernel that we break it forever (you can't compile Mutter's per-frame decision tree into eBPF).
Give us a primitive representing work completion, so we can keep optimistically pipelining operations. We're happy to pass around explicit-synchronisation tokens (dma_fence, drm_syncobj, drm_newthing, whatever it is): plumbing through a sync token to synchronise compositor operations against client operations in both directions is just a matter of boring typing.
Make that primitive something that is every bit as usable across subsystems as it is across processes. It should be a lowest common denominator for middleware that ultimately provokes GPU execbuf, KMS commit, and media codec ops; currently that would be both wait and signal for all of VkSemaphore, EGLSyncKHR, KMS fence, V4L (D)QBUF, and VA-API {en,de}code ops. It must be exportable to and importable from an FD, which can be poll()ed on and read(). GPU-side visibility for late binding is nice, but not at all essential.
Make that primitive complete in 'reasonable' time, no matter what. There will always be failures in extremis, no matter what the design: absent hard-realtime principles from hardware all the way up to userspace, something will always be able to fail somewhere: non-terminating GPU work, actual GPU hang/reset, GPU queue DoSed, CPU scheduler, I/O DoSed. As long as the general case is bounded-time completion, each of these can be mitigated separately as long as userspace has enough visibility into the underlying mechanics, and cares enough to take meaningful action on it.
And something more concrete:
dma_fence.
This already has all of the properties described above. Kernel-wise, it already devolves to CPU-side signaling when it crosses device boundaries. We need to support it roughly forever since it's been plumbed so far and so wide. Any primitive which is acceptable for winsys-like usage which crosses so many device/subsystem/process/security boundaries has to meet the same requirements. So why reinvent something which looks so similar, and has the same requirements of the kernel babysitting completion, providing little to no benefit for that difference?
It's not usable for complex usecases, as we've established, but winsys is not that usecase. We can draw a hard boundary between the two worlds. For example, a client could submit an infinitely deep CS -> VS/FS/etc job chain with potentially-infinite completion, with the FS output being passed to the winsys for composition. Draw the line post-FS: export a dma_fence against FS completion. But instead of this being based on monitoring the _fence_ per se, base it on monitoring the job; if the final job doesn't retire in reasonable time, signal the fence and signal (like, SIGKILL, or just tear down the context and permanently -EIO, whatever) the client. Maybe for future hardware that would be the same thing - the kernel setting a timeout and comparing a read on a particular address against a particular value - but the 'present fence' proposal seems like it requires exactly this anyway.
That to me is the best compromise. We allow clients complete arbitrary flexibility, but as soon as they vkQueuePresentKHR, they're crossing a boundary out of happy fun GPU land and into strange hostile winsys land. We've got a lot of practice at being the bad guys who hate users and are always trying to ruin their dreams, so we'll happily wear the impact of continuing to do that. In doing so, we collectively don't have to invent a third new synchronisation primitive (to add to dma_fence and drm_syncobj) and a third new synchronisation model (implicit sync, explicit-but-bounded sync, explicit-and-maybe-unbounded sync) to support this, and we don't have to do an NT4 where GDI was shoved into the kernel.
It doesn't help with the goal of ridding dma_fence from the kernel, but it does very clearly segregate the two worlds. Drawing that hard boundary would allow drivers to hyperoptimise for clients which want to be extremely clever and agile and quick because they're sailing so close to the wind that they cannot bear the overhead of dma_fence, whilst also providing the guarantees we need when crossing isolation boundaries. In the latter case, the overhead of bouncing into a less-optimised primitive is totally acceptable because it's not even measurable: vkQueuePresentKHR requires client CPU activity -> kernel IPC -> compositor CPU activity -> wait for repaint cycle -> prepare scene -> composition, against which dma_fence overhead isn't and will never be measurable (even if it doesn't cross device/subsystem boundaries, which it probably does). And the converse for vkAcquireNextImageKHR.
tl;dr: we don't need to move winsys into the kernel, winsys and compute don't need to share sync primitives, the client/winsys boundary does need to have a sync primitive does need strong and onerous guarantees, and that transition can be several orders of magnitude less efficient than intra-client sync primitives
Shoot me down. :)
Cheers, Daniel
Am 20.04.21 um 19:44 schrieb Daniel Stone:
Completely agree.
+1
Exactly, yes.
Finally somebody who understands me :)
Well the question is then how do we get winsys and your own process space together then?
Ignoring everything below since that is the display pipeline I'm not really interested in. My concern is how to get the buffer from the client to the server without allowing the client to get the server into trouble?
My thinking is still to use timeouts to acquire texture locks. E.g. when the compositor needs to access texture it grabs a lock and if that lock isn't available in less than 20ms whoever is holding it is killed hard and the lock given to the compositor.
It's perfectly fine if a process has a hung queue, but if it tries to send buffers which should be filled by that queue to the compositor it just gets a corrupted window content.
Regards, Christian.
On Tue, 20 Apr 2021 at 19:00, Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
It's a jarring transition. If you take a very narrow view and say 'it's all GPU work in shared buffers so it should all work the same', then client<->winsys looks the same as client<->client gbuffer. But this is a trap.
Just because you can mmap() a file on an NFS server in New Zealand doesn't mean that you should have the same expectations of memory access to that file as you do to of a pointer from alloca(). Even if the primitives look the same, you are crossing significant boundaries, and those do not come without a compromise and a penalty.
Kill the client hard. If the compositor has speculatively queued sampling against rendering which never completed, let it access garbage. You'll have one frame of garbage (outdated content, all black, random pattern; the failure mode is equally imperfect, because there is no perfect answer), then the compositor will notice the client has disappeared and remove all its resources.
It's not possible to completely prevent this situation if the compositor wants to speculatively pipeline work, only ameliorate it. From a system-global point of view, just expose the situation and let it bubble up. Watch the number of fences which failed to retire in time, and destroy the context if there are enough of them (maybe 1, maybe 100). Watch the number of contexts the file description get forcibly destroyed, and destroy the file description if there are enough of them. Watch the number of descriptions which get forcibly destroyed, and destroy the process if there are enough of them. Watch the number of processes in a cgroup/pidns which get forcibly destroyed, and destroy the ... etc. Whether it's the DRM driver or an external monitor such as systemd/Flatpak/podman/Docker doing that is pretty immaterial, as long as the concept of failure bubbling up remains.
(20ms is objectively the wrong answer FWIW, because we're not a hard RTOS. But if our biggest point of disagreement is 20 vs. 200 vs. 2000 vs. 20000 ms, then this thread has been a huge success!)
Cheers, Daniel
On Tue, Apr 20, 2021 at 8:16 PM Daniel Stone daniel@fooishbar.org wrote:
I think this is where I think we have have a serious gap of what a winsys or a compositor is. Like if you have only a single wayland server running on a physical machine this is easy. But add a VR compositor, an intermediate compositor (say gamescope), Xwayland and some containers/VM, some video capture (or, gasp, a browser that doubles as compositor) and this story gets seriously complicated. Like who are you protecting from who? at what point is something client<->winsys vs. client<->client?
Hi,
On Tue, 20 Apr 2021 at 20:03, Bas Nieuwenhuizen bas@basnieuwenhuizen.nl wrote:
As I've said upthread, the line is _seriously_ blurred, and is only getting less clear. Right now, DRI3 cannot even accept a dma_fence, let alone a drm_syncobj, let alone a memory fence.
Crossing those boundaries is hard, and requires as much thinking as typing. That's a good thing.
Conflating every synchronisation desire into a single userspace-visible primitive makes this harder, because it treats game threads the same as other game threads the same as VR compositors the same as embedding browsers the same as compositors etc. Drawing very clear lines between game threads and the external world, with explicit weakening as necessary, makes those jarring transitions of privilege and expectation clear and explicit. Which is a good thing, since we're trying to move away from magic and implicit.
Cheers, Daniel
On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone daniel@fooishbar.org wrote:
Yeah return fence for flips/presents sounds unappealing. Android tried it, we convinced them it's not great and they changed that.
So I can mostly get behind this, except it's _not_ going to be dma_fence. That thing has horrendous internal ordering constraints within the kernel, and the one thing that doesn't allow you is to make a dma_fence depend upon a userspace fence.
But what we can do is use the same currently existing container objects like drm_syncobj or sync_file (timeline syncobj would fit best tbh), and stuff a userspace fence behind it. The only trouble is that currently timeline syncobj implement vulkan's spec, which means if you build a wait-before-signal deadlock, you'll wait forever. Well until the user ragequits and kills your process.
So for winsys we'd need to be able to specify the wait timeout somewhere for waiting for that dma_fence to materialize (plus the submit thread, but userspace needs that anyway to support timeline syncobj) if you're importing an untrusted timeline syncobj. And I think that's roughly it.
The fancy version would allow you to access the underlying memory fence from the cmd streamer and do fancy conditional rendering and fun stuff like that (pick old/new frame depending which one is ready), but that's the fancy advanced compositor on top here. The "give me the same thing as I get with dma_fence implicit sync today" would just need the timeout for imporiting untrusted timeline syncobj.
So a vk extension, and also probably a gl extension for timeline syncobj (not sure that exists already), which probably wants to specify the reasonable timeout limit by default. Because that's more the gl way of doing things.
Oh also I really don't want to support this for implicit sync, but heck we could even do that. It would stall pretty bad because there's no submit thread in userspace. But we could then optimize that with some new dma-buf ioctl to get out the syncobj, kinda like what Jason has already proposed for sync_file or so. And then userspace which has a submit thread could handle it correctly. -Daniel
Hi,
On Tue, 20 Apr 2021 at 19:54, Daniel Vetter daniel@ffwll.ch wrote:
Right. The only way you get to materialise a dma_fence from an execbuf is that you take a hard timeout, with a penalty for not meeting that timeout. When I say dma_fence I mean dma_fence, because there is no extant winsys support for drm_symcobj, so this is greenfield: the winsys gets to specify its terms of engagement, and again, we've been the orange/green-site enemies of users for quite some time already, so we're happy to continue doing so. If the actual underlying primitive is not a dma_fence, and compositors/protocol/clients need to eat a bunch of typing to deal with a different primitive which offers the same guarantees, then that's fine, as long as there is some tangible whole-of-system benefit.
How that timeout is actually realised is an implementation detail. Whether it's a property of the last GPU job itself that the CPU-side driver can observe, or that the kernel driver guarantees that there is a GPU job launched in parallel which monitors the memory-fence status and reports back through a mailbox/doorbell, or the CPU-side driver enqueues kqueue work for $n milliseconds' time to check the value in memory and kill the context if it doesn't meet expectations - whatever. I don't believe any of those choices meaningfully impact on kernel driver complexity relative to the initial proposal, but they do allow us to continue to provide the guarantees we do today when buffers cross security boundaries.
There might well be an argument for significantly weakening those security boundaries and shifting the complexity from the DRM scheduler into userspace compositors. So far though, I have yet to see that argument made coherently.
Cheers, Daniel
On Tue, Apr 20, 2021 at 9:14 PM Daniel Stone daniel@fooishbar.org wrote:
So atm sync_file doesn't support future fences, but we could add the support for those there. And since vulkan doesn't really say anything about those, we could make the wait time out by default.
How that timeout is actually realised is an implementation detail. Whether it's a property of the last GPU job itself that the CPU-side driver can observe, or that the kernel driver guarantees that there is a GPU job launched in parallel which monitors the memory-fence status and reports back through a mailbox/doorbell, or the CPU-side driver enqueues kqueue work for $n milliseconds' time to check the value in memory and kill the context if it doesn't meet expectations - whatever. I don't believe any of those choices meaningfully impact on kernel driver complexity relative to the initial proposal, but they do allow us to continue to provide the guarantees we do today when buffers cross security boundaries.
The thing is, you can't do this in drm/scheduler. At least not without splitting up the dma_fence in the kernel into separate memory fences and sync fences, and the work to get there is imo just not worth it. We've bikeshedded this ad nauseaum for vk timeline syncobj, and the solution was to have the submit thread in the userspace driver.
It won't really change anything wrt what applications can observe from the egl/gl side of things though.
There might well be an argument for significantly weakening those security boundaries and shifting the complexity from the DRM scheduler into userspace compositors. So far though, I have yet to see that argument made coherently.
Ah we've had that argument. We have moved that into userspace as part of vk submit threads. It aint pretty, but it's better than the other option :-) -Daniel
Hi,
On Tue, 20 Apr 2021 at 20:30, Daniel Vetter daniel@ffwll.ch wrote:
I'm starting to think this thread needs its own glossary ...
I propose we use 'residency fence' for execution fences which enact memory-residency operations, e.g. faulting in a page ultimately depending on GPU work retiring.
And 'value fence' for the pure-userspace model suggested by timeline semaphores, i.e. fences being (*addr == val) rather than being able to look at ctx seqno.
Cheers, Daniel
Thanks everybody. The initial proposal is dead. Here are some thoughts on how to do it differently.
I think we can have direct command submission from userspace via memory-mapped queues ("user queues") without changing window systems.
The memory management doesn't have to use GPU page faults like HMM. Instead, it can wait for user queues of a specific process to go idle and then unmap the queues, so that userspace can't submit anything. Buffer evictions, pinning, etc. can be executed when all queues are unmapped (suspended). Thus, no BO fences and page faults are needed.
Inter-process synchronization can use timeline semaphores. Userspace will query the wait and signal value for a shared buffer from the kernel. The kernel will keep a history of those queries to know which process is responsible for signalling which buffer. There is only the wait-timeout issue and how to identify the culprit. One of the solutions is to have the GPU send all GPU signal commands and all timed out wait commands via an interrupt to the kernel driver to monitor and validate userspace behavior. With that, it can be identified whether the culprit is the waiting process or the signalling process and which one. Invalid signal/wait parameters can also be detected. The kernel can force-signal only the semaphores that time out, and punish the processes which caused the timeout or used invalid signal/wait parameters.
The question is whether this synchronization solution is robust enough for dma_fence and whatever the kernel and window systems need.
Marek
On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone daniel@fooishbar.org wrote:
On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
The proper model here is the preempt-ctx dma_fence that amdkfd uses (without page faults). That means dma_fence for synchronization is doa, at least as-is, and we're back to figuring out the winsys problem.
"We'll solve it with timeouts" is very tempting, but doesn't work. It's akin to saying that we're solving deadlock issues in a locking design by doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it avoids having to reach the reset button, but that's about it.
And the fundamental problem is that once you throw in userspace command submission (and syncing, at least within the userspace driver, otherwise there's kinda no point if you still need the kernel for cross-engine sync) means you get deadlocks if you still use dma_fence for sync under perfectly legit use-case. We've discussed that one ad nauseam last summer:
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_f...
See silly diagramm at the bottom.
Now I think all isn't lost, because imo the first step to getting to this brave new world is rebuilding the driver on top of userspace fences, and with the adjusted cmd submit model. You probably don't want to use amdkfd, but port that as a context flag or similar to render nodes for gl/vk. Of course that means you can only use this mode in headless, without glx/wayland winsys support, but it's a start. -Daniel
If we don't use future fences for DMA fences at all, e.g. we don't use them for memory management, it can work, right? Memory management can suspend user queues anytime. It doesn't need to use DMA fences. There might be something that I'm missing here.
What would we lose without DMA fences? Just inter-device synchronization? I think that might be acceptable.
The only case when the kernel will wait on a future fence is before a page flip. Everything today already depends on userspace not hanging the gpu, which makes everything a future fence.
Marek
On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, daniel@ffwll.ch wrote:
Correct, we wouldn't have synchronization between device with and without user queues any more.
That could only be a problem for A+I Laptops.
Memory management will just work with preemption fences which pause the user queues of a process before evicting something. That will be a dma_fence, but also a well known approach.
Christian.
Am 27.04.21 um 13:49 schrieb Marek Olšák:
On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák maraeo@gmail.com wrote:
Ok. I'll interpret this as "yes, it will work, let's do it".
It works if all you care about is drm/amdgpu. I'm not sure that's a reasonable approach for upstream, but it definitely is an approach :-)
We've already gone somewhat through the pain of drm/amdgpu redefining how implicit sync works without sufficiently talking with other people, maybe we should avoid a repeat of this ... -Daniel
Am 27.04.21 um 14:15 schrieb Daniel Vetter:
BTW: This is coming up again for the plan here.
We once more need to think about the "other" fences which don't participate in the implicit sync here.
Christian.
I'll defer to Christian and Alex to decide whether dropping sync with non-amd devices (GPUs, cameras etc.) is acceptable.
Rewriting those drivers to this new sync model could be done on a case by case basis.
For now, would we only lose the "amd -> external" dependency? Or the "external -> amd" dependency too?
Marek
On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, daniel@ffwll.ch wrote:
Only amd -> external.
We can easily install something in an user queue which waits for a dma_fence in the kernel.
But we can't easily wait for an user queue as dependency of a dma_fence.
The good thing is we have this wait before signal case on Vulkan timeline semaphores which have the same problem in the kernel.
The good news is I think we can relatively easily convert i915 and older amdgpu device to something which is compatible with user fences.
So yes, getting that fixed case by case should work.
Christian
Am 27.04.21 um 14:46 schrieb Marek Olšák:
Ok. So that would only make the following use cases broken for now: - amd render -> external gpu - amd video encode -> network device
What about the case when we get a buffer from an external device and we're supposed to make it "busy" when we are using it, and the external device wants to wait until we stop using it? Is it something that can happen, thus turning "external -> amd" into "external <-> amd"?
Marek
On Tue., Apr. 27, 2021, 08:50 Christian König, < ckoenig.leichtzumerken@gmail.com> wrote:
Uff good question. DMA-buf certainly supports that use case, but I have no idea if that is actually used somewhere.
Daniel do you know any case?
Christian.
Am 27.04.21 um 15:26 schrieb Marek Olšák:
Hi,
Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
FWIW, "only" breaking amd render -> external gpu will make us pretty unhappy, as we have some cases where we are combining an AMD APU with a FPGA based graphics card. I can't go into the specifics of this use- case too much but basically the AMD graphics is rendering content that gets composited on top of a live video pipeline running through the FPGA.
Zero-copy texture sampling from a video input certainly appreciates this very much. Trying to pass the render fence through the various layers of userspace to be able to tell when the video input can reuse a buffer is a great experience in yak shaving. Allowing the video input to reuse the buffer as soon as the read dma_fence from the GPU is signaled is much more straight forward.
Regards, Lucas
On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach l.stach@pengutronix.de wrote:
I concur. I have quite a few users with a multi-GPU setup involving AMD hardware.
Note, if this brokenness can't be avoided, I'd prefer a to get a clear error, and not bad results on screen because nothing is synchronized anymore.
On Tue, Apr 27, 2021 at 1:35 PM Simon Ser contact@emersion.fr wrote:
It's an upcoming requirement for windows[1], so you are likely to start seeing this across all GPU vendors that support windows. I think the timing depends on how quickly the legacy hardware support sticks around for each vendor.
Alex
[1] - https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/
On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher alexdeucher@gmail.com wrote:
Hm, okay.
Will using the existing explicit synchronization APIs make it work properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync + EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)
On Tue, Apr 27, 2021 at 06:27:27PM +0000, Simon Ser wrote:
If you have hw which really _only_ supports userspace direct submission (i.e. the ringbuffer has to be in the same gpu vm as everything else by design, and can't be protected at all with e.g. read-only pte entries) then all that stuff would be broken. -Daniel
On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
Yeah but hw scheduling doesn't mean the hw has to be constructed to not support isolating the ringbuffer at all.
E.g. even if the hw loses the bit to put the ringbuffer outside of the userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o pte flags. Otherwise the entire "share address space with cpu side, seamlessly" thing is out of the window.
And with that r/o bit on the ringbuffer you can once more force submit through kernel space, and all the legacy dma_fence based stuff keeps working. And we don't have to invent some horrendous userspace fence based implicit sync mechanism in the kernel, but can instead do this transition properly with drm_syncobj timeline explicit sync and protocol reving.
At least I think you'd have to work extra hard to create a gpu which cannot possibly be intercepted by the kernel, even when it's designed to support userspace direct submit only.
Or are your hw engineers more creative here and we're screwed? -Daniel
Am 28.04.21 um 12:05 schrieb Daniel Vetter:
The upcomming hardware generation will have this hardware scheduler as a must have, but there are certain ways we can still stick to the old approach:
1. The new hardware scheduler currently still supports kernel queues which essentially is the same as the old hardware ring buffer.
2. Mapping the top level ring buffer into the VM at least partially solves the problem. This way you can't manipulate the ring buffer content, but the location for the fence must still be writeable.
For now and the next hardware we are save to support the old submission model, but the functionality of kernel queues will sooner or later go away if it is only for Linux.
So we need to work on something which works in the long term and get us away from this implicit sync.
Christian.
-Daniel
On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
Yeah allowing userspace to lie about completion fences in this model is ok. Though I haven't thought through full consequences of that, but I think it's not any worse than userspace lying about which buffers/address it uses in the current model - we rely on hw vm ptes to catch that stuff.
Also it might be good to switch to a non-recoverable ctx model for these. That's already what we do in i915 (opt-in, but all current umd use that mode). So any hang/watchdog just kills the entire ctx and you don't have to worry about userspace doing something funny with it's ringbuffer. Simplifies everything.
Also ofc userspace fencing still disallowed, but since userspace would queu up all writes to its ringbuffer through the drm/scheduler, we'd handle dependencies through that still. Not great, but workable.
Thinking about this, not even mapping the ringbuffer r/o is required, it's just that we must queue things throug the kernel to resolve dependencies and everything without breaking dma_fence. If userspace lies, tdr will shoot it and the kernel stops running that context entirely.
So I think even if we have hw with 100% userspace submit model only we should be still fine. It's ofc silly, because instead of using userspace fences and gpu semaphores the hw scheduler understands we still take the detour through drm/scheduler, but at least it's not a break-the-world event.
Or do I miss something here?
Yeah I think we have pretty clear consensus on that goal, just no one yet volunteered to get going with the winsys/wayland work to plumb drm_syncobj through, and the kernel/mesa work to make that optionally a userspace fence underneath. And it's for a sure a lot of work. -Daniel
On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
Also no page fault support, userptr invalidates still stall until end-of-batch instead of just preempting it, and all that too. But I mean there needs to be some motivation to fix this and roll out explicit sync :-) -Daniel
Am 28.04.21 um 14:26 schrieb Daniel Vetter:
Thinking more about that approach I don't think that it will work correctly.
See we not only need to write the fence as signal that an IB is submitted, but also adjust a bunch of privileged hardware registers.
When userspace could do that from its IBs as well then there is nothing blocking it from reprogramming the page table base address for example.
We could do those writes with the CPU as well, but that would be a huge performance drop because of the additional latency.
Christian.
On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
That's not what I'm suggesting. I'm suggesting you have the queue and everything in userspace, like in wondows. Fences are exactly handled like on windows too. The difference is:
- All new additions to the ringbuffer are done through a kernel ioctl call, using the drm/scheduler to resolve dependencies.
- Memory management is also done like today int that ioctl.
- TDR makes sure that if userspace abuses the contract (which it can, but it can do that already today because there's also no command parser to e.g. stop gpu semaphores) the entire context is shot and terminally killed. Userspace has to then set up a new one. This isn't how amdgpu recovery works right now, but i915 supports it and I think it's also the better model for userspace error recovery anyway.
So from hw pov this will look _exactly_ like windows, except we never page fault.
From sw pov this will look _exactly_ like current kernel ringbuf model,
with exactly same dma_fence semantics. If userspace lies, does something stupid or otherwise breaks the uapi contract, vm ptes stop invalid access and tdr kills it if it takes too long.
Where do you need priviledge IB writes or anything like that?
Ofc kernel needs to have some safety checks in the dma_fence timeline that relies on userspace ringbuffer to never go backwards or unsignal, but that's kinda just more compat cruft to make the kernel/dma_fence path work.
Cheers, Daniel
On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
I'm confused. How does this work on windows then with pure userspace submit? Windows userspace sets its priorties and vm registers itself from userspace? -Daniel
Am 28.04.21 um 16:34 schrieb Daniel Vetter:
The priorities and VM registers are setup from the hw scheduler on windows, but this comes with preemption again.
And just letting the kernel write to the ring buffer hast the same problems as userspace fences. E.g. userspace could just overwrite the command which write the fence value with NOPs.
In other words we certainly need some kind of protection for the ring buffer, e.g. setting it readonly and making sure that it can always write the fence and is never preempted by the HW scheduler. But that protection breaks our neck at different places again.
That solution could maybe work, but it is certainly not something we have tested.
Christian.
-Daniel
On Wed, Apr 28, 2021 at 04:45:01PM +0200, Christian König wrote:
The thing is, if the hw scheduler preempts your stuff a bit occasionally, what's the problem? Essentially it just looks like each context is it's own queue that can make forward progress.
Also, I'm assuming there's some way in the windows model to make sure that unpriviledged userspace can't change the vm registers and priorities itself. Those work the same in both worlds.
My point is: You don't need protection. With the current cs ioctl userspace can already do all kinds of nasty stuff and break itself. gpu pagetables and TDR make sure nothing bad happens.
So imo you don't actually need to protected anything in the ring, as long as you don't bother supporting recoverable TDR. You just declare the entire ring shot and ask userspace to set up a new one.
That solution could maybe work, but it is certainly not something we have tested.
Again, you can run the entire hw like on windows. The only thing you add on top is that new stuff gets added to the userspace ring through the kernel, so that the kernel can make sure all the resulting dma_fence are still properly ordered, wont deadlock and will complete in due time (using TDR). Also, the entire memory management works like now, but from a hw point of view that's also not different. It just means that page faults are never fixed, but the response is always that there's really no page present at that slot.
I'm really not seeing the fundamental problem, nor why exactly you need a completely different hw model here. -Daniel
On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter daniel@ffwll.ch wrote:
When the user allocates usermode queues, the kernel driver sets up a queue descriptor in the kernel which defines the location of the queue in memory, what priority it has, what page tables it should use, etc. User mode can then start writing commands to its queues. When they are ready for the hardware to start executing them, they ring a doorbell which signals the scheduler and it maps the queue descriptors to HW queue slots and they start executing. The user only has access to it's queues and any buffers it has mapped in it's GPU virtual address space. While the queues are scheduled, the user can keep submitting work to them and they will keep executing unless they get preempted by the scheduler due to oversubscription or a priority call or a request from the kernel driver to preempt, etc.
Alex
On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
Yeah, works like with our stuff.
I don't see a problem tbh. It's slightly silly going the detour with the kernel ioctl, and it's annoying that you still have to use drm/scheduler to resolve dependencies instead of gpu semaphores and all that. But this only applies to legacy winsys mode, compute (e.g. vk without winsys) can use the full power. Just needs a flag or something when setting up the context.
And best part is that from hw pov this really is indistinguishable from the full on userspace submit model.
The thing where it gets annoying is when you use one of these new cpu instructions which do direct submit to hw and pass along the pasid id behind the scenes. That's truly something you can't intercept anymore in the kernel and fake the legacy dma_fence world.
But what you're describing here sounds like bog standard stuff, and also pretty easy to keep working with exactly the current model.
Ofc we'll want to push forward a more modern model that better suits modern gpus, but I don't see any hard requirement here from the hw side.
Cheers, Daniel
On Thu, Apr 29, 2021 at 1:12 PM Daniel Vetter daniel@ffwll.ch wrote:
Adding a bit more detail on what I have in mind:
- memory management works like amdgpu does today, so all buffers are pre-bound to the gpu vm, we keep the entire bo set marked as busy with the bulk lru trick for every command submission.
- for the ringbuffer, userspace allcoates a suitably sized bo for ringbuffer, ring/tail/seqno and whatever else it needs
- userspace then asks the kernel to make that into a hw context, with all the priviledges setup. Doorbell will only be mapped into kernel (hw can't tell the difference anyway), but if it happens to also be visible to userspace that's no problem. We assume userspace can ring the doorbell anytime it wants to.
- we do double memory management: One dma_fence works similar to the amdkfd preempt fence, except it doesn't preempt but does anything required to make the hw context unrunnable and take it out of the hw scheduler entirely. This might involve unmapping the doorbell if userspace has access to it.
- but we also do classic end-of-batch fences, so that implicit fencing and all that keeps working. The "make hw ctx unrunnable" fence must also wait for all of these pending submissions to complete.
- for the actual end-of-batchbuffer dma_fence it's almost all faked, but with some checks in the kernel to keep up the guarantees. cs flow is roughly
1. userspace directly writes into the userspace ringbuffer. It needs to follow the kernel's rule for this if it wants things to work correctly, but we assume evil userspace is allowed to write whatever it wants to the ring, and change that whenever it wants. Userspace does not update ring head/tail pointers.
2. cs ioctl just contains: a) head (the thing userspace advances, tail is where the gpu consumes) pointer value to write to kick of this new batch b) in-fences b) out-fence.
3. kernel drm/scheduler handles this like any other request and first waits for the in-fences to all signal, then it executes the CS. For execution it simply writes the provided head value into the ring's metadata, and rings the doorbells. No checks. We assume userspace can update the tail whenever it feels like, so checking the head value is pointless anyway.
4. the entire correctness is only depending upon the dma_fences working as they should. For that we need some very strict rules on when the end-of-batchbuffer dma_fence signals: - the drm/scheduler must have marked the request as runnable already, i.e. all dependencies are fullfilled. This is to prevent the fences from signalling in the wrong order. - the fence from the previous batch must have signalled already, again to guarantee in-order signalling (even if userspace does something stupid and reorders how things complete) - the fence must never jump back to unsignalled, so the lockless fastpath that just checks the seqno is a no-go
5. if drm/scheduler tdr decides it's taking too long we throw the entire context away, forbit further command submission on it (through the ioctl, userspace can keep writing to the ring whatever it wants) and fail all in-flight buffers with an error. Non-evil userspace can then recover by re-creating a new ringbuffer with everything.
I've pondered this now for a bit and I really can't spot the holes. And I think it should all work, both for hw and kernel/legacy dma_fence use-case. -Daniel
Am 30.04.21 um 10:58 schrieb Daniel Vetter:
This doesn't work in hardware. We at least need to setup a few registers and memory locations from inside the VM which userspace shouldn't have access to when we want the end of batch fence and ring buffer start to be reliable.
This together doesn't work from the software side, e.g. you can either have preemption fences or end of batch fences but never both or your end of batch fences would have another dependency on the preemption fences which we currently can't express in the dma_fence framework.
Additional to that it can't work from the hardware side because we have a separation between engine and scheduler on the hardware side. So we can't reliable get a signal inside the kernel that a batch has completed.
What we could do is to get this signal in userspace, e.g. userspace inserts the packets into the ring buffer and then the kernel can read the fence value and get the IV.
But this has the same problem as user fences because it requires the cooperation of userspace.
We just yesterday had a meeting with the firmware developers to discuss the possible options and I now have even stronger doubts that this is doable.
We either have user queues where userspace writes the necessary commands directly to the ring buffer or we have kernel queues. A mixture of both isn't supported in neither the hardware nor the firmware.
Regards, Christian.
On Fri, Apr 30, 2021 at 11:08 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
The thing is, we don't care whether it's reliable or not. Userspace is allowed to lie, not signal, signal the wrong thing, out of order, everything.
The design assumes all this is possible.
So unless you can't signal at all from userspace, this works. And for the "can't signal at all" it just means something needs to do a cpu busy wait and burn down lots of cpu time. I hope that's not your hw design :-)
It's _not_ a preempt fence. It's an ctx unload fence. Not the same thing. Normal preempt fence would indeed fail.
Nope. Read the thing again, I'm assuming that userspace lies. The kernel's dma_fence code compensates for that.
Also note that userspace can already lie to it's heart's content with the current IB stuff. You are already allowed to hang the gpu, submit utter garbage, render to the wrong buffer or just scribble all over your own IB. This isn't a new problem.
Yup. Please read my thing again carefully, I'm stating that userspace writes all the necessary commands directly into the ringbuffer.
The kernel writes _nothing_ into the ringbuffer. The only thing it does is update the head pointer to unblock that next section of the ring, when drm/scheduler thinks that's ok to do.
This works, you just thinking of something completely different than what I write down :-)
Cheers, Daniel
Hi,
On Fri, 30 Apr 2021 at 10:35, Daniel Vetter daniel@ffwll.ch wrote:
I've been sitting this one out so far because what other-Dan's proposed seems totally sensible and workable for me, so I'll let him argue it rather than confuse it.
But - yes. Our threat model does not care about a malicious content which deliberately submits garbage and then gets the compositor to display garbage. If that's the attack then you could just emit noise from your frag shader.
Cheers, Daniel
On Wednesday, April 28th, 2021 at 2:21 PM, Daniel Vetter daniel@ffwll.ch wrote:
I'm interested in helping with the winsys/wayland bits, assuming the following:
- We are pretty confident that drm_syncobj won't be superseded by something else in the near future. It seems to me like a lot of effort has gone into plumbing sync_file stuff all over, and it already needs replacing (I mean, it'll keep working, but we have a better replacement now. So compositors which have decided to ignore explicit sync for all this time won't have to do the work twice.) - Plumbing drm_syncobj solves the synchronization issues with upcoming AMD hardware, and all of this works fine in cross-vendor multi-GPU setups. - Someone is willing to spend a bit of time bearing with me and explaining how this all works. (I only know about sync_file for now, I'll start reading the Vulkan bits.)
Are these points something we can agree on?
Thanks,
Simon
On Wed, Apr 28, 2021 at 6:31 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Even if it didn't go away completely, no one else will be using it. This leaves a lot of under-validated execution paths that lead to subtle bugs. When everyone else moved to KIQ for queue management, we stuck with MMIO for a while in Linux and we ran into tons of subtle bugs that disappeared when we moved to KIQ. There were lots of assumptions about how software would use different firmware interfaces or not which impacted lots of interactions with clock and powergating to name a few. On top of that, you need to use the scheduler to utilize stuff like preemption properly. Also, if you want to do stuff like gang scheduling (UMD scheduling multiple queues together), it's really hard to do with kernel software schedulers.
Alex
Trying to figure out which e-mail in this mess is the right one to reply to....
On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach l.stach@pengutronix.de wrote:
Assuming said external GPU doesn't support memory fences. If we do amdgpu and i915 at the same time, that covers basically most of the external GPU use-cases. Of course, we'd want to convert nouveau as well for the rest.
I think it's worth taking a step back and asking what's being here before we freak out too much. If we do go this route, it doesn't mean that your FPGA use-case can't work, it just means it won't work out-of-the box anymore. You'll have to separate execution and memory dependencies inside your FPGA driver. That's still not great but it's not as bad as you maybe made it sound.
Oh, it's definitely worse than that. Every window system interaction is bi-directional. The X server has to wait on the client before compositing from it and the client has to wait on X before re-using that back-buffer. Of course, we can break that later dependency by doing a full CPU wait but that's going to mean either more latency or reserving more back buffers. There's no good clean way to claim that any of this is one-directional.
--Jason
Jason, both memory-based signalling as well as interrupt-based signalling to the CPU would be supported by amdgpu. External devices don't need to support memory-based sync objects. The only limitation is that they can't convert amdgpu sync objects to dma_fence.
The sad thing is that "external -> amdgpu" dependencies are really "external <-> amdgpu" dependencies due to mutually-exclusive access required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the only interop that would initially work with those buffers. Explicitly sync'd buffers also won't work if other drivers convert explicit fences to dma_fence. Thus, both implicit sync and explicit sync might not work with other drivers at all. The only interop that would initially work is explicit fences with memory-based waiting and signalling on the external device to keep the kernel out of the picture.
Marek
On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand jason@jlekstrand.net wrote:
On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák maraeo@gmail.com wrote:
Jason, both memory-based signalling as well as interrupt-based signalling to the CPU would be supported by amdgpu. External devices don't need to support memory-based sync objects. The only limitation is that they can't convert amdgpu sync objects to dma_fence.
Sure. I'm not worried about the mechanism. We just need a word that means "the new fence thing" and I've been throwing "memory fence" around for that. Other mechanisms may work as well.
The sad thing is that "external -> amdgpu" dependencies are really "external <-> amdgpu" dependencies due to mutually-exclusive access required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the only interop that would initially work with those buffers. Explicitly sync'd buffers also won't work if other drivers convert explicit fences to dma_fence. Thus, both implicit sync and explicit sync might not work with other drivers at all. The only interop that would initially work is explicit fences with memory-based waiting and signalling on the external device to keep the kernel out of the picture.
Yup. This is where things get hard. That said, I'm not quite ready to give up on memory/interrupt fences just yet.
One thought that came to mind which might help would be if we added an extremely strict concept of memory ownership. The idea would be that any given BO would be in one of two states at any given time:
1. legacy: dma_fences and implicit sync works as normal but it cannot be resident in any "modern" (direct submission, ULLS, whatever you want to call it) context
2. modern: In this mode they should not be used by any legacy context. We can't strictly prevent this, unfortunately, but maybe we can say reading produces garbage and writes may be discarded. In this mode, they can be bound to modern contexts.
In theory, when in "modern" mode, you could bind the same buffer in multiple modern contexts at a time. However, when that's the case, it makes ownership really tricky to track. Therefore, we might want some sort of dma-buf create flag for "always modern" vs. "switchable" and only allow binding to one modern context at a time when it's switchable.
If we did this, we may be able to move any dma_fence shenanigans to the ownership transition points. We'd still need some sort of "wait for fence and transition" which has a timeout. However, then we'd be fairly well guaranteed that the application (not just Mesa!) has really and truly decided it's done with the buffer and we wouldn't (I hope!) end up with the accidental edges in the dependency graph.
Of course, I've not yet proven any of this correct so feel free to tell me why it won't work. :-) It was just one of those "about to go to bed and had a thunk" type thoughts.
--Jason
P.S. Daniel was 100% right when he said this discussion needs a glossary.
On Wed., Apr. 28, 2021, 00:01 Jason Ekstrand, jason@jlekstrand.net wrote:
We'd like to keep userspace outside of Mesa drivers intact and working except for interop where we don't have much choice. At the same time, future hw may remove support for kernel queues, so we might not have much choice there either, depending on what the hw interface will look like.
The idea is to have an ioctl for querying a timeline semaphore buffer associated with a shared BO, and an ioctl for querying the next wait and signal number (e.g. n and n+1) for that semaphore. Waiting for n would be like mutex lock and signaling would be like mutex unlock. The next process would use the same ioctl and get n+1 and n+2, etc. There is a deadlock condition because one process can do lock A, lock B, and another can do lock B, lock A, which can be prevented such that the ioctl that returns the numbers would return them for multiple buffers at once. This solution needs no changes to userspace outside of Mesa drivers, and we'll also keep the BO wait ioctl for GPU-CPU sync.
Marek
On Tue, 27 Apr 2021 at 22:06, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Correct, we wouldn't have synchronization between device with and without user queues any more.
That could only be a problem for A+I Laptops.
Since I think you mentioned you'd only be enabling this on newer chipsets, won't it be a problem for A+A where one A is a generation behind the other?
I'm not really liking where this is going btw, seems like a ill thought out concept, if AMD is really going down the road of designing hw that is currently Linux incompatible, you are going to have to accept a big part of the burden in bringing this support in to more than just amd drivers for upcoming generations of gpu.
Dave.
Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
Marek
On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie airlied@gmail.com wrote:
Hi Dave,
Am 27.04.21 um 21:23 schrieb Marek Olšák:
Crap, that is a good point as well.
Well we don't really like that either, but we have no other option as far as I can see.
I have a couple of ideas how to handle this in the kernel without dma_fences, but it always require more or less changes to all existing drivers.
Christian.
On 2021-04-28 8:59 a.m., Christian König wrote:
I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?
Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.
I'm probably missing something though, awaiting enlightenment. :)
On Wed, Apr 28, 2021 at 11:07:09AM +0200, Michel Dänzer wrote:
Yeah in all this discussion what's unclear to me is, is this a hard amdgpu requirement going forward, in which case you need a time machine and lots of people to retroactively fix this because this aint fast to get fixed.
Or is this just musings for an ecosystem that better fits current&future hw, for which I think we all agree where the rough direction is?
The former is quite a glorious situation, and I'm with Dave here that if your hw engineers really removed the bit to not map the ringbuffers to userspace, then amd gets to eat a big chunk of the cost here. -Daniel
On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer michel@daenzer.net wrote:
The hw interface for userspace is that the ring buffer is mapped to the process address space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU writes into it, it tells the hw scheduler that there are new GPU commands in the ring buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer and then "rings" the doorbell. It's my understanding that the ring buffer and the doorbell are always mapped in the same GPU address space as the process, which makes it very difficult to emulate the current protected ring buffers in the kernel. The VMID of the ring buffer is also not changeable.
The hw scheduler doesn't do any synchronization and it doesn't see any dependencies. It only chooses which queue to execute, so it's really just a simple queue manager handling the virtualization aspect and not much else.
Marek
On Sat, May 1, 2021 at 6:27 PM Marek Olšák maraeo@gmail.com wrote:
The doorbell does not have to be mapped into the process's GPU virtual address space. The CPU could write to it directly. Mapping it into the GPU's virtual address space would allow you to have a device kick off work however rather than the CPU. E.g., the GPU could kick off it's own work or multiple devices could kick off work without CPU involvement.
Alex
Sorry for the top-post but there's no good thing to reply to here...
One of the things pointed out to me recently by Daniel Vetter that I didn't fully understand before is that dma_buf has a very subtle second requirement beyond finite time completion: Nothing required for signaling a dma-fence can allocate memory. Why? Because the act of allocating memory may wait on your dma-fence. This, as it turns out, is a massively more strict requirement than finite time completion and, I think, throws out all of the proposals we have so far.
Take, for instance, Marek's proposal for userspace involvement with dma-fence by asking the kernel for a next serial and the kernel trusting userspace to signal it. That doesn't work at all if allocating memory to trigger a dma-fence can blow up. There's simply no way for the kernel to trust userspace to not do ANYTHING which might allocate memory. I don't even think there's a way userspace can trust itself there. It also blows up my plan of moving the fences to transition boundaries.
Not sure where that leaves us.
--Jason
On Mon, May 3, 2021 at 9:42 AM Alex Deucher alexdeucher@gmail.com wrote:
Am 03.05.21 um 16:59 schrieb Jason Ekstrand:
Well at least I was perfectly aware of that :)
I'm currently experimenting with some sample code which would allow implicit sync with user fences.
Not that I'm pushing hard into that directly, but I just want to make clear how simple or complex the whole thing would be.
Christian.
On Mon, May 3, 2021 at 10:03 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
I'd have been a bit disappointed if this had been news to you. :-P However, there are a number of us plebeians on the thread who need things spelled out sometimes. :-)
I'd like to see that. It'd be good to know what our options are. Honestly, if we can get implicit sync somehow without tying our hands w.r.t. how fences work in modern drivers, that's the opens a lot of doors.
--Jason
On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand jason@jlekstrand.net wrote:
Honestly the more I look at things I think userspace-signalable fences with a timeout sound like they are a valid solution for these issues. Especially since (as has been mentioned countless times in this email thread) userspace already has a lot of ways to cause timeouts and or GPU hangs through GPU work already.
Adding a timeout on the signaling side of a dma_fence would ensure:
- The dma_fence signals in finite time - If the timeout case does not allocate memory then memory allocation is not a blocker for signaling.
Of course you lose the full dependency graph and we need to make sure garbage collection of fences works correctly when we have cycles. However, the latter sounds very doable and the first sounds like it is to some extent inevitable.
I feel like I'm missing some requirement here given that we immediately went to much more complicated things but can't find it. Thoughts?
- Bas
On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen bas@basnieuwenhuizen.nl wrote:
Timeouts are sufficient to protect the kernel but they make the fences unpredictable and unreliable from a userspace PoV. One of the big problems we face is that, once we expose a dma_fence to userspace, we've allowed for some pretty crazy potential dependencies that neither userspace nor the kernel can sort out. Say you have marek's "next serial, please" proposal and a multi-threaded application. Between time time you ask the kernel for a serial and get a dma_fence and submit the work to signal that serial, your process may get preempted, something else shoved in which allocates memory, and then we end up blocking on that dma_fence. There's no way userspace can predict and defend itself from that.
So I think where that leaves us is that there is no safe place to create a dma_fence except for inside the ioctl which submits the work and only after any necessary memory has been allocated. That's a pretty stiff requirement. We may still be able to interact with userspace a bit more explicitly but I think it throws any notion of userspace direct submit out the window.
--Jason
What about direct submit from the kernel where the process still has write access to the GPU ring buffer but doesn't use it? I think that solves your preemption example, but leaves a potential backdoor for a process to overwrite the signal commands, which shouldn't be a problem since we are OK with timeouts.
Marek
On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand jason@jlekstrand.net wrote:
Proposal for a new CS ioctl, kernel pseudo code:
lock(&global_lock); serial = get_next_serial(dev); add_wait_command(ring, serial - 1); add_exec_cmdbuf(ring, user_cmdbuf); add_signal_command(ring, serial); *ring->doorbell = FIRE; unlock(&global_lock);
See? Just like userspace submit, but in the kernel without concurrency/preemption. Is this now safe enough for dma_fence?
Marek
On Mon, May 3, 2021 at 4:36 PM Marek Olšák maraeo@gmail.com wrote:
Unfortunately as I pointed out to Daniel as well this won't work 100% reliable either.
See the signal on the ring buffer needs to be protected by manipulation from userspace so that we can guarantee that the hardware really has finished executing when it fires.
Protecting memory by immediate page table updates is a good first step, but unfortunately not sufficient (and we would need to restructure large parts of the driver to make this happen).
On older hardware we often had the situation that for reliable invalidation we need the guarantee that every previous operation has finished executing. It's not so much of a problem when the next operation has already started, since then we had the opportunity to do things in between the last and the next operation. Just see cache invalidation and VM switching for example.
Additional to that it doesn't really buy us anything, e.g. there is not much advantage to this. Writing the ring buffer in userspace and then ringing in the kernel has the same overhead as doing everything in the kernel in the first place.
Christian.
Am 04.05.21 um 05:11 schrieb Marek Olšák:
On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
Unfortunately as I pointed out to Daniel as well this won't work 100% reliable either.
You're claiming this, but there's no clear reason why really, and you did't reply to my last mail on that sub-thread, so I really don't get where exactly you're seeing a problem.
Nope you don't. Userspace is already allowed to submit all kinds of random garbage, the only thing the kernel has to guarnatee is: - the dma-fence DAG stays a DAG - dma-fence completes in finite time
Everything else is not the kernel's problem, and if userspace mixes stuff up like manipulates the seqno, that's ok. It can do that kind of garbage already.
This is why you need the unload-fence on top, because indeed you can't just rely on the fences created from the userspace ring, those are unreliable for memory management.
btw I thought some more, and I think it's probably best if we only attach the unload-fence in the ->move(_notify) callbacks. Kinda like we already do for async copy jobs. So the overall buffer move sequence would be:
1. wait for (untrusted for kernel, but necessary for userspace correctness) fake dma-fence that rely on the userspace ring
2. unload ctx
3. copy buffer
Ofc 2&3 would be done async behind a dma_fence.
If you have gpu page faults you generally have synchronous tlb invalidation, so this also shouldn't be a big problem. Combined with the unload fence at least. If you don't have synchronous tlb invalidate it gets a bit more nasty and you need to force a preemption to a kernel context which has the required flushes across all the caches. Slightly nasty, but the exact same thing would be required for handling page faults anyway with the direct userspace submit model.
Again I'm not seeing a problem.
It gets you dma-fence backwards compat without having to rewrite the entire userspace ecosystem. Also since you have the hw already designed for ringbuffer in userspace it would be silly to copy that through the cs ioctl, that's just overhead.
Also I thought the problem you're having is that all the kernel ringbuf stuff is going away, so the old cs ioctl wont work anymore for sure?
Maybe also pick up that other subthread which ended with my last reply.
Cheers, Daniel
Am 04.05.21 um 09:32 schrieb Daniel Vetter:
Yeah, it's rather hard to explain without pointing out how the hardware works in detail.
And exactly that's the problem! We can't provide a reliable unload-fence and the user fences are unreliable for that.
I've talked this through lengthy with our hardware/firmware guy last Thursday but couldn't find a solution either.
We can have a preemption fence for the kernel which says: Hey this queue was scheduled away you can touch it's hardware descriptor, control registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again. But that one is only triggered on preemption and then we have the same ordering problems once more.
Or we can have a end of operation fence for userspace which says: Hey this queue has finished it's batch of execution, but this one is manipulable from userspace in both finish to early (very very bad for invalidations and memory management) or finish to late/never (deadlock prone but fixable by timeout).
What we could do is to use the preemption fence to emulate the unload fence, e.g. something like: 1. Preempt the queue in fixed intervals (let's say 100ms). 2. While preempted check if we have reached the checkpoint in question by looking at the hardware descriptor. 3. If we have reached the checkpoint signal the unload fence. 4. If we haven't reached the checkpoint resume the queue again.
The problem is that this might introduce a maximum of 100ms delay before signaling the unload fence and preempt/resume has such a hefty overhead that we waste a horrible amount of time on it.
Please tell that our hardware engineers :)
We have two modes of operation, see the whole XNACK on/off discussion on the amdgfx mailing list.
We still have a bit more time for this. As I learned from our firmware engineer last Thursday the Windows side is running into similar problems as we do.
Maybe also pick up that other subthread which ended with my last reply.
I will send out another proposal for how to handle user fences shortly.
Cheers, Christian.
On Tue, May 4, 2021 at 10:09 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
So your hw can preempt? That's good enough.
The unload fence is just 1. wait for all dma_fence that are based on the userspace ring. This is unreliable, but we don't care because tdr will make it reliable. And once tdr shot down a context we'll force-unload and thrash it completely, which solves the problem. 2. preempt the context, which /should/ now be stuck waiting for more commands to be stuffed into the ringbuffer. Which means your preemption is hopefully fast enough to not matter. If your hw takes forever to preempt an idle ring, I can't help you :-)
Also, if userspace lies to us and keeps pushing crap into the ring after it's supposed to be idle: Userspace is already allowed to waste gpu time. If you're too worried about this set a fairly aggressive preempt timeout on the unload fence, and kill the context if it takes longer than what preempting an idle ring should take (because that would indicate broken/evil userspace).
Again, I'm not seeing the problem. Except if your hw is really completely busted to the point where it can't even support userspace ringbuffers properly and with sufficient performance :-P
Of course if you issue the preempt context request before the userspace fences have finished (or tdr cleaned up the mess) like you do in your proposal, then it will be ridiculously expensive and/or wont work. So just don't do that.
I didn't find this anywhere with a quick search. Pointers to archive (lore.kernel.org/amd-gfx is the best imo).
This story sounds familiar, I've heard it a few times here at intel too on various things where we complained and then windows hit the same issues too :-)
E.g. I've just learned that all the things we've discussed around gpu page faults vs 3d workloads and how you need to reserve some CU for 3d guaranteed forward progress or even worse measures is also something they're hitting on Windows. Apparently they fixed it by only running 3d or compute workloads at the same time, but not both.
Maybe also pick up that other subthread which ended with my last reply.
I will send out another proposal for how to handle user fences shortly.
Maybe let's discuss this here first before we commit to requiring all userspace to upgrade to user fences ... I do agree that we want to go there too, but breaking all the compositors is probably not the best option.
Cheers, Daniel
Am 04.05.21 um 10:27 schrieb Daniel Vetter:
Yeah, it just takes to long for the preemption to complete to be really useful for the feature we are discussing here.
As I said when the kernel requests to preempt a queue we can easily expect a timeout of ~100ms until that comes back. For compute that is even in the multiple seconds range.
The "preemption" feature is really called suspend and made just for the case when we want to put a process to sleep or need to forcefully kill it for misbehavior or stuff like that. It is not meant to be used in normal operation.
If we only attach it on ->move then yeah maybe a last resort possibility to do it this way, but I think in that case we could rather stick with kernel submissions.
I think you have the wrong expectation here. It is perfectly valid and expected for userspace to keep writing commands into the ring buffer.
After all when one frame is completed they want to immediately start rendering the next one.
Can't find that of hand either, but see the amdgpu_noretry module option.
It basically tells the hardware if retry page faults should be supported or not because this whole TLB shutdown thing when they are supported is extremely costly.
I'm not even sure if we are going to see user fences on Windows with the next hw generation.
Before we can continue with this discussion we need to figure out how to get the hardware reliable first.
In other words if we would have explicit user fences everywhere, how would we handle timeouts and misbehaving processes? As it turned out they haven't figured this out on Windows yet either.
I was more thinking about handling it all in the kernel.
Christian.
On Tue, May 04, 2021 at 11:14:06AM +0200, Christian König wrote:
100ms for preempting an idle request sounds like broken hw to me. Of course preemting something that actually runs takes a while, that's nothing new. But it's also not the thing we're talking about here. Is this 100ms actual numbers from hw for an actual idle ringbuffer?
Well this is a hybrid userspace ring + kernel augmeted submit mode, so you can keep dma-fences working. Because the dma-fence stuff wont work with pure userspace submit, I think that conclusion is rather solid. Once more even after this long thread here.
Sure, for the true userspace direct submit model. But with that you don't get dma-fence, which means this gpu will not work for 3d accel on any current linux desktop.
Which sucks, hence some hybrid model of using the userspace ring and kernel augmented submit is needed. Which was my idea.
Hm so synchronous tlb shootdown is a lot more costly when you allow retrying of page faults?
That sounds bad, because for full hmm mode you need to be able to retry pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just hang your gpu for a long time while it's waiting for the va->pa lookup response to return. So retrying lookups shouldn't be any different really.
And you also need fairly fast synchronous tlb shootdown for hmm. So if your hw has a problem with both together that sounds bad.
Lol.
Yeah can do, just means that you also have to copy the ringbuffer stuff over from userspace to the kernel.
It also means that there's more differences in how your userspace works between full userspace mode (necessary for compute) and legacy dma-fence mode (necessary for desktop 3d). Which is especially big fun for vulkan, since that will have to do both.
But then amd is still hanging onto the amdgpu vs amdkfd split, so you're going for max pain in this area anyway :-P -Daniel
Am 04.05.21 um 11:47 schrieb Daniel Vetter:
Well 100ms is just an example of the scheduler granularity. Let me explain in a wider context.
The hardware can have X queues mapped at the same time and every Y time interval the hardware scheduler checks if those queues have changed and only if they have changed the necessary steps to reload them are started.
Multiple queues can be rendering at the same time, so you can have X as a high priority queue active and just waiting for a signal to start and the client rendering one frame after another and a third background compute task mining bitcoins for you.
As long as everything is static this is perfectly performant. Adding a queue to the list of active queues is also relatively simple, but taking one down requires you to wait until we are sure the hardware has seen the change and reloaded the queues.
Think of it as an RCU grace period. This is simply not something which is made to be used constantly, but rather just at process termination.
When assisted with unload fences, then yes. Problem is that I can't see how we could implement those performant currently.
I'm not sure of that. I've looked a bit into how we could add user fences to dma_resv objects and that isn't that hard after all.
Which sucks, hence some hybrid model of using the userspace ring and kernel augmented submit is needed. Which was my idea.
Yeah, I think when our firmware folks would really remove the kernel queue and we still don't have
Partially correct, yes.
See when you have retry page faults enabled and unmap something you need to make sure that everybody which could have potentially translated that page and has a TLB is either invalidated or waited until the access is completed.
Since every CU could be using a memory location that takes ages to completed compared to the normal invalidation where you just invalidate the L1/L2 and are done.
Additional to that the recovery adds some extra overhead to every memory access, so even without a fault you are quite a bit slower if this is enabled.
Completely agree. And since it was my job to validate the implementation on Vega10 I was also the first one to realize that.
Felix, a couple of others and me are trying to work around those restrictions ever since.
That is my least worry. The IBs are just addr+length., so no more than 16 bytes for each IB.
That is the bigger problem.
Christian.
On Tue, May 4, 2021 at 12:53 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Uh ... that indeed sounds rather broken.
Otoh it's just a dma_fence that'd we'd inject as this unload-fence. So by and large everyone should already be able to cope with it taking a bit longer. So from a design pov I don't see a huge problem, but I guess you guys wont be happy since it means on amd hw there will be random unsightly stalls in desktop linux usage.
Is there really no way to fix fw here? Like if process start/teardown takes 100ms, that's going to suck no matter what.
I think as a proof of concept it's fine, but as an actual solution ... pls no. Two reasons: - implicit sync is bad - this doesn't fix anything for explicit sync using dma_fence in terms of sync_file or drm_syncobj.
So if we go with the route of papering over this in the kernel, then it'll be a ton more work than just hacking something into dma_resv.
Yeah I think kernel queue can be removed. But the price is that you need reasonable fast preempt of idle contexts.
I really can't understand how this can take multiple ms, something feels very broken in the design of the fw (since obviously the hw can preempt an idle context to another one pretty fast, or you'd render any multi-client desktop as a slideshow at best).
Well yes it's complicated, and it's even more fun when the tlb invalidate comes in through the IOMMU through ATS.
But also if you don't your hw is just broken from a security pov, no page fault handling for you. So it's really not optional.
Ah ok, maybe I'm biased from drm/i915 where an ib launch + seqno is rather long, because the hw folks keep piling more workarounds and additional flushes on top. Like on some hw the recommended w/a was to just issue 32 gpu cache flushes or something like that (otherwise the seqno write could arrive before the gpu actually finished flushing) :-/
Cheers, Daniel
Am 04.05.21 um 13:13 schrieb Daniel Vetter:
Well I wouldn't call it broken. It's just not made for the use case we are trying to abuse it for.
Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
Yeah, exactly that's why it isn't much of a problem for process termination or freeing memory.
As I said adding the queue is unproblematic and teardown just results in a bit more waiting to free things up.
Problematic is more overcommit swapping and OOM situations which need to wait for the hw scheduler to come back and tell us that the queue is now unmapped.
Well can't disagree with that :) But I think we can't avoid supporting it.
Exactly.
If we do implicit sync or explicit sync is orthogonal to the problems that sync must be made reliable somehow.
So when we sync and timeout the waiter should just continue, but whoever failed to signal will be punished.
But since this isn't solved on Windows I don't see how we can solve it on Linux either.
So if we go with the route of papering over this in the kernel, then it'll be a ton more work than just hacking something into dma_resv.
I'm just now prototyping that and at least for the driver parts it doesn't look that hard after all.
Well the hardware doesn't preempt and idle context. See you can have a number of active ("mapped" in the fw terminology) contexts and idle contexts are usually kept active even when they are idle.
So when multi-client desktop switches between context then that is rather fast, but when the kernel asks for a context to be unmapped that can take rather long.
Yeah, but that is also a known issue. You either have retry faults and live with the extra overhead or you disable them and go with the kernel based submission approach.
Well I once had a conversation with a hw engineer which wanted to split up the TLB in validations into 1Gib chunks :)
That would have mean we would need to emit 2^17 different invalidation requests on the kernel ring buffer....
Christian.
On Tue, May 04, 2021 at 02:48:35PM +0200, Christian König wrote:
Ok so your hw really hates the unload fence. On ours the various queues are a bit more explicit, so largely unload/preempt is the same as context switch and pretty quick. Afaik at least.
Still baffled that you can't fix this in fw, but oh well. Judging from how fast our fw team moves I'm not surprised :-/
Anyway so next plan: Make this work exactly like hmm: 1. wait for the user fence as a dma-fence fake thing, tdr makes this safe 2. remove pte 3. do synchronous tlb flush
Tada, no more 100ms stall in your buffer move callbacks. And feel free to pack up 2&3 into an async worker or something if it takes too long and treating it as a bo move dma_fence is better. Also that way you might be able to batch up the tlb flushing if it's too damn expensive, by collecting them all under a single dma_fence (and starting a new tlb flush cycle every time ->enable_signalling gets called).
As long as you nack any gpu faults and don't try to fill them for these legacy contexts that support dma-fence there's no harm in using the hw facilities.
Ofc if you're now telling me your synchronous tlb flush is also 100ms, then maybe just throw the hw out the window, and accept that the millisecond anything evicts anything (good look with userptr) the screen freezes for a bit.
Well kernel based submit is out with your new hw it sounds, so retry faults and sync tlb invalidate is the price you have to pay. There's no "both ways pls" here :-)
Well on the cpu side you invalidate tlbs as ranges, but there's a fallback to just flush the entire thing if the range flush is too much. So it's not entirely bonkers, just that the global flush needs to be there still. -Daniel
I see some mentions of XNACK and recoverable page faults. Note that all gaming AMD hw that has userspace queues doesn't have XNACK, so there is no overhead in compute units. My understanding is that recoverable page faults are still supported without XNACK, but instead of the compute unit replaying the faulting instruction, the L1 cache does that. Anyway, the point is that XNACK is totally irrelevant here.
Marek
On Tue., May 4, 2021, 08:48 Christian König, < ckoenig.leichtzumerken@gmail.com> wrote:
On Tue, May 4, 2021 at 12:16 PM Marek Olšák maraeo@gmail.com wrote:
I'm looking forward to seeing the prototype because...
Regardless of implicit vs. explicit sync, the fundamental problem we have to solve is the same. I'm moderately hopeful that if Christian has an idea for how to do it with dma_resv that maybe we can translate that in a semi-generic way to syncobj. Yes, I realize I just waved my hands and made all the big problems go away. Except I really didn't. I made them all Christian's problems. :-P
--Jason
On Wed, Apr 28, 2021 at 08:59:47AM +0200, Christian König wrote:
Yeah one horrible idea is to essentially do the plan we hashed out for adding userspace fences to drm_syncobj timelines. And then add drm_syncobj as another implicit fencing thing to dma-buf.
But: - This is horrible. We're all agreeing that implicit sync is not a great idea, building an entire new world on this flawed thing doesn't sound like a good path forward.
- It's kernel uapi, so it's going to be forever.
- It's only fixing the correctness issue, since you have to stall for future/indefinite fences at the beginning of the CS ioctl. Or at the beginning of the atomic modeset ioctl, which kinda defeats the point of nonblocking.
- You still have to touch all kmd drivers.
- For performance, you still have to glue a submit thread onto all gl drivers.
It is horrendous. -Daniel
On Tue, Apr 27, 2021 at 1:38 PM Dave Airlie airlied@gmail.com wrote:
In case my previous e-mail sounded too enthusiastic, I'm also pensive about this direction. I'm not sure I'm ready to totally give up on all of Linux WSI just yet. We definitely want to head towards memory fences and direct submission but I'm not convinced that throwing out all of interop is necessary. It's certainly a very big hammer and we should try to figure out something less destructive, if that's possible. (I don't know for sure that it is.)
--Jason
On Tue, Apr 27, 2021 at 1:49 PM Marek Olšák maraeo@gmail.com wrote:
If we don't use future fences for DMA fences at all, e.g. we don't use them for memory management, it can work, right? Memory management can suspend user queues anytime. It doesn't need to use DMA fences. There might be something that I'm missing here.
Other drivers use dma_fence for their memory management. So unles you've converted them all over to the dma_fence/memory fence split, dma_fence fences stay memory fences. In theory this is possible, but maybe not if you want to complete the job this decade :-)
What would we lose without DMA fences? Just inter-device synchronization? I think that might be acceptable.
The only case when the kernel will wait on a future fence is before a page flip. Everything today already depends on userspace not hanging the gpu, which makes everything a future fence.
That's not quite what we defined as future fences, because tdr guarantees those complete, even if userspace hangs. It's when you put userspace fence waits into the cs buffer you've submitted to the kernel (or directly to hw) where the "real" future fences kick in. -Daniel
On Tue, Apr 20, 2021 at 1:54 PM Daniel Vetter daniel@ffwll.ch wrote:
Let me elaborate on this a bit. One of the problems I mentioned earlier is the conflation of fence types inside the kernel. dma_fence is used for solving two different semi-related but different problems: client command synchronization and memory residency synchronization. In the old implicit GL world, we conflated these two and thought we were providing ourselves a service. Not so much....
It's all well and good to say that we should turn the memory fence into a dma_fence and throw a timeout on it. However, these window-system sync primitives, as you said, have to be able to be shared across everything. In particular, we have to be able to share them with drivers that don't make a good separation between command and memory synchronization.
Let's say we're rendering on ANV with memory fences and presenting on some USB display adapter whose kernel driver is a bit old-school. When we pass that fence to the other driver via a sync_file or similar, that driver may shove that dma_fence into the dma_resv on some buffer somewhere. Then our client, completely unaware of internal kernel dependencies, binds that buffer into its address space and kicks off another command buffer. So i915 throws in a dependency on that dma_resv which contains the previously created dma_fence and refuses to execute any more command buffers until it signals. Unfortunately, unbeknownst to i915, that command buffer which the client kicked off after doing that bind was required for signaling the memory fence on which our first dma_fence depends. Deadlock.
Sure, we put a timeout on the dma_fence and it will eventually fire and unblock everything. However, there's one very important point that's easy to miss here: Neither i915 nor the client did anything wrong in the above scenario. The Vulkan footgun approach works because there are a set of rules and, if you follow those rules, you're guaranteed everything works. In the above scenario, however, the client followed all of the rules and got a deadlock anyway. We can't have that.
Yeah, it may be that this approach can be made to work. Instead of reusing dma_fence, maybe we can reuse syncobj and have another form of syncobj which is a memory fence, a value to wait on, and a timeout.
--Jason
On Tue, Apr 20, 2021 at 9:17 PM Jason Ekstrand jason@jlekstrand.net wrote:
Nope. Because the waiting for this future fence will only happen in two places: - driver submit thread, which is just userspace without holding anything. From the kernel pov this can be preempted, memory temporarily taken away, all these things. Until that's done you will _not_ get a real dma_fence, but just another future fence. - but what about the usb display you're asking? well for that we'll need a new atomic extension, which takes a timeline syncobj and gives you back a timeline syncobj. And the rules are that if one of the is a future fence/userspace fence, so will the other (even if it's created by the kernel)
Either way you get a timeline syncobj back which anv can then again handle properly with it's submit thread. Not a dma_fence with a funny timeout because there's deadlock issues with those.
So no you wont be able to get a dma_fence out of your slight of hands here.
It's going to be the same container. But very much not a dma_fence.
Note the other approach is if you split the kernel's notion of what a dma_fence is into two parts: memory fence and synchronization primitive. The trouble is that there's tons of hw for which these are by necessity the same things (because they can't preempt or dont have a scheduler), so the value of this for the overall ecosystem is slim. And the work to make it happen (plump future fences through the drm/scheduler and everything) is giantic. drm/i915-gem tried, the result is not pretty and we're now backing it largely all out least because it's not where hw/vulkan/compute are actually going I think.
So that's an approach which I think does exist in theory, but really not something I think we should attempt. -Daniel
Hi,
On Mon, 19 Apr 2021 at 11:48, Marek Olšák maraeo@gmail.com wrote:
Another thought: with completely arbitrary userspace fencing, none of this is helpful either. If the compositor can't guarantee that a hostile client has submitted a fence which will never be signaled, then it won't be waiting on it, so it already needs infrastructure to handle something like this. That already handles the crashed-client case, because if the client crashes, then its connection will be dropped, which will trigger the compositor to destroy all its resources anyway, including any pending waits.
GPU hangs also look pretty similar; it's an infinite wait, until the client resubmits a new buffer which would replace (& discard) the old.
So signal-fence-on-process-exit isn't helpful and doesn't provide any extra reliability; it in fact probably just complicates things.
Cheers, Daniel
Am 20.04.21 um 16:53 schrieb Daniel Stone:
Exactly that's the problem. A compositor isn't immediately informed that the client crashed, instead it is still referencing the buffer and trying to use it for compositing.
GPU hangs also look pretty similar; it's an infinite wait, until the client resubmits a new buffer which would replace (& discard) the old.
Correct. You just need to assume that all queues get destroyed and re-initialized when a GPU reset happens.
So signal-fence-on-process-exit isn't helpful and doesn't provide any extra reliability; it in fact probably just complicates things.
Well it is when you go for partial GPU resets.
Regards, Christian.
On Tue, 20 Apr 2021 at 15:58, Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
If the compositor no longer has a guarantee that the buffer will be ready for composition in a reasonable amount of time (which dma_fence gives us, and this proposal does not appear to give us), then the compositor isn't trying to use the buffer for compositing, it's waiting asynchronously on a notification that the fence has signaled before it attempts to use the buffer.
Marek's initial suggestion is that the kernel signal the fence, which would unblock composition (and presumably show garbage on screen, or at best jump back to old content).
My position is that the compositor will know the process has crashed anyway - because its socket has been closed - at which point we destroy all the client's resources including its windows and buffers regardless. Signaling the fence doesn't give us any value here, _unless_ the compositor is just blindly waiting for the fence to signal ... which it can't do because there's no guarantee the fence will ever signal.
Cheers, Daniel
Am 20.04.21 um 17:07 schrieb Daniel Stone:
Yeah, but that assumes that the compositor has change to not blindly wait for the client to finish rendering and as Daniel explained that is rather unrealistic.
What we need is a fallback mechanism which signals the fence after a timeout and gives a penalty to the one causing the timeout.
That gives us the same functionality we have today with the in software scheduler inside the kernel.
Regards, Christian.
Cheers, Daniel
Hi,
On Tue, 20 Apr 2021 at 16:16, Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
OK, if that's the case then I think I'm really missing something which isn't explained in this thread, because I don't understand what the additional complexity and API change gains us (see my first reply in this thread).
By way of example - say I have a blind-but-explicit compositor that takes a drm_syncobj along with a dmabuf with each client presentation request, but doesn't check syncobj completion, it just imports that into a VkSemaphore + VkImage and schedules work for the next frame.
Currently, that generates an execbuf ioctl for the composition (ignore KMS for now) with a sync point to wait on, and the kernel+GPU scheduling guarantees that the composition work will not begin until the client rendering work has retired. We have a further guarantee that this work will complete in reasonable time, for some value of 'reasonable'.
My understanding of this current proposal is that: * userspace creates a 'present fence' with this new ioctl * the fence becomes signaled when a value is written to a location in memory, which is visible through both CPU and GPU mappings of that page * this 'present fence' is imported as a VkSemaphore (?) and the userspace Vulkan driver will somehow wait on this value either before submitting work or as a possibly-hardware-assisted GPU-side wait (?) * the kernel's scheduler is thus eliminated from the equation, and every execbuf is submitted directly to hardware, because either userspace knows that the fence has already been signaled, or it will issue a GPU-side wait (?) * but the kernel is still required to monitor completion of every fence itself, so it can forcibly complete, or penalise the client (?)
Lastly, let's say we stop ignoring KMS: what happens for the render-with-GPU-display-on-KMS case? Do we need to do the equivalent of glFinish() in userspace and only submit the KMS atomic request when the GPU work has fully retired?
Clarifying those points would be really helpful so this is less of a strawman. I have some further opinions, but I'm going to wait until I understand what I'm actually arguing against before I go too far. :) The last point is very salient though.
Cheers, Daniel
Daniel, imagine hardware that can only do what Windows does: future fences signalled by userspace whenever userspace wants, and no kernel queues like we have today.
The only reason why current AMD GPUs work is because they have a ring buffer per queue with pointers to userspace command buffers followed by fences. What will we do if that ring buffer is removed?
Marek
On Tue, Apr 20, 2021 at 11:50 AM Daniel Stone daniel@fooishbar.org wrote:
On Tue, Apr 20, 2021, 09:25 Marek Olšák maraeo@gmail.com wrote:
Hmm, that sounds kinda like what we're trying to do for Libre-SOC's gpu which is basically where the cpu (exactly the same cores as the gpu) runs a user-space software renderer with extra instructions to make it go fast, so the kernel only gets involved for futex-wait or for video scan-out. This causes problems when figuring out how to interact with dma-fences for interoperability...
Jacob Lifshay
On Tue, 20 Apr 2021 at 17:25, Marek Olšák maraeo@gmail.com wrote:
I can totally imagine that; memory fences are clearly a reality and we need to make them work for functionality as well as performance. Let's imagine that winsys joins that flying-car future of totally arbitrary sync, that we work only on memory fences and nothing else, and that this all happens by the time we're all vaccinated and can go cram into a room with 8000 other people at FOSDEM instead of trying to do this over email.
But the first couple of sentences of your proposal has the kernel monitoring those synchronisation points to ensure that they complete in bounded time. That already _completely_ destroys the purity of the simple picture you paint. Either there are no guarantees and userspace has to figure it out, or there are guarantees and we have to compromise that purity.
I understand how you arrived at your proposal from your perspective as an extremely skilled driver developer who has delivered gigantic performance improvements to real-world clients. As a winsys person with a very different perspective, I disagree with you on where you are drawing the boundaries, to the point that I think your initial proposal is worse than useless; doing glFinish() or the VkFence equivalent in clients would be better in most cases than the first mail.
I don't want to do glFinish (which I'm right about), and you don't want to do dma_fence (which you're right about). So let's work together to find a middle ground which we're both happy with. That middle ground does exist, and we as winsys people are happy to eat a significant amount of pain to arrive at that middle ground. Your current proposal is at once too gentle on the winsys, and far too harsh on it. I only want to move where and how those lines are drawn, not to pretend that all the world is still a single-context FIFO execution engine.
Cheers, Daniel
On Tue, Apr 20, 2021 at 6:25 PM Marek Olšák maraeo@gmail.com wrote:
Daniel, imagine hardware that can only do what Windows does: future fences signalled by userspace whenever userspace wants, and no kernel queues like we have today.
The only reason why current AMD GPUs work is because they have a ring buffer per queue with pointers to userspace command buffers followed by fences. What will we do if that ring buffer is removed?
Well this is an entirely different problem than what you set out to describe. This is essentially the problem where hw does not have any support for priviledged commands and separate priviledges command buffer, and direct userspace submit is the only thing that is available.
I think if this is your problem, then you get to implement some very interesting compat shim. But that's an entirely different problem from what you've described in your mail. This pretty much assumes at the hw level the only thing that works is ATS/pasid, and vram is managed with HMM exclusively. Once you have that pure driver stack you get to fake it in the kernel for compat with everything that exists already. How exactly that will look and how exactly you best construct your dma_fences for compat will depend highly upon how much is still there in this hw (e.g. wrt interrupt generation). A lot of the infrastructure was also done as part of drm_syncobj. I mean we have entirely fake kernel drivers like vgem/vkms that create dma_fence, so a hw ringbuffer is really not required.
So ... is this your problem underneath it all, or was that more a wild strawman for the discussion? -Daniel
dri-devel@lists.freedesktop.org