Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

9 Aug 2021


      On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
...
Hi Daniel,
...
...
...
...
...
>> The solution:
>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle
(including
...
...
>> the 9 ms wait) when the Host compositor sends the frame callback event to its
clients.
...
...
>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before
sending
...
...
>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means
that,
...
...
the
...
...
>> Guest compositor has to be forced to use a new buffer for its next repaint cycle
when it
...
...
>> gets a pageflip completion.
>
> Is that really the only solution?
[Kasireddy, Vivek] There are a few others I mentioned here:
https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
But I think none of them are as compelling as this one.
>
> If we fix the event timestamps so that both guest and host use the same
> timestamp, but then the guest starts 5ms (or something like that) earlier,
> then things should work too? I.e.
> - host compositor starts at (previous_frametime + 9ms)
> - guest compositor starts at (previous_frametime + 4ms)
>
> Ofc this only works if the frametimes we hand out to both match _exactly_
> and are as high-precision as the ones on the host side. Which for many gpu
> drivers at least is the case, and all the ones you care about for sure :-)
>
> But if the frametimes the guest receives are the no_vblank fake ones, then
> they'll be all over the place and this carefully tuned low-latency redraw
> loop falls apart. Aside fromm the fact that without tuning the guests to
> be earlier than the hosts, you're guaranteed to miss every frame (except
> when the timing wobbliness in the guest is big enough by chance to make
> the deadline on the oddball frame).
[Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
share these between the Guest and the Host. It does not seem to be causing any
other
...
...
...
...
problems so far but we did try the experiment you mentioned (i.e., adjusting the
delays)
...
...
and it works. However, this patch series is meant to fix the issue without having to
tweak
...
...
anything (delays) because we can't do this for every compositor out there.
Maybe there could be a mechanism which allows the compositor in the guest to
automatically adjust its repaint cycle as needed.
...
This might even be possible without requiring changes in each compositor, by
adjusting
...
...
the vertical blank periods in the guest to be aligned with the host compositor repaint
cycles. Not sure about that though.
...
Even if not, both this series or making it possible to queue multiple flips require
corresponding changes in each compositor as well to have any effect.
Yeah from all the discussions and tests done it sounds even with a
deeper queue we have big coordination issues between the guest and
host compositor (like the example that the guest is now rendering at
90fps instead of 60fps like the host).
[Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs
60 FPS problem is a completely different issue that is associated with Qemu GTK UI
backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
scanout FB onto one of the backbuffers managed by EGL.
I am trying to add a new Qemu Wayland UI backend so that we can eliminate that Blit
and thereby have a truly zero-copy solution. And, this is there I am running into the
halved frame-rate issue -- the current problem.
Yes, that's what I referenced. But I disagree that it's a different
problem. The underlying problem in both cases is that the guest and host
compositor free-wheel instead of rendering in sync. It's just that
depending upon how exactly the flip completion event on the gues side
plays out you either get guest rendering that's faster than the host-side
60fps, or guest rendering that's much slower than the host-side 60fps.
[Kasireddy, Vivek] That used to be the case before we added a synchronization
mechanism to the GTK UI backend that uses a sync file. After adding this
and making the Guest wait until this sync file fd on the Host is signaled, we
consistently get 60 FPS because the flip completion event for the Guest is
directly tied to the signaling of the sync file in this particular case (GTK UI).
...
The fundamental problem in both cases is that they don't run in lockstep.
If you fix that, through fixing the timestamp and even reporting most
likely, you should be able to fix both bugs.
[Kasireddy, Vivek] GTK UI is an EGL based solution that Blits the Guest scanout
FB onto one of the backbuffers managed by EGL. Wayland UI is a zero-copy
solution that just wraps the dmabuf associated with Guest scanout FB in a 
wl_buffer and submits it directly to the Host compositor. These backends are
completely independent of each other and cannot be active at the same time.
In other words, we cannot have zero-copy and Blit based solutions running
parallelly. And, this issue is only relevant for Wayland UI backend and has 
nothing to do with GTK UI.
...
...
...
Hence my gut feeling reaction that first we need to get these two
compositors aligned in their timings, which propobably needs
consistent vblank periods/timestamps across them (plus/minux
guest/host clocksource fun ofc). Without this any of the next steps
will simply not work because there's too much jitter by the time the
guest compositor gets the flip completion events.
[Kasireddy, Vivek] Timings are not a problem and do not significantly
affect the repaint cycles from what I have seen so far.
...
Once we have solid events I think we should look into statically
tuning guest/host compositor deadlines (like you've suggested in a
bunch of places) to consisently make that deadline and hit 60 fps.
With that we can then look into tuning this automatically and what to
do when e.g. switching between copying and zero-copy on the host side
(which might be needed in some cases) and how to handle all that.
[Kasireddy, Vivek] As I confirm here: https://gitlab.freedesktop.org/wayland/weston/-
/issues/514#note_984065
...
tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
I feel that this zero-copy solution I am trying to create should be independent
of compositors' deadlines, delays or other scheduling parameters.
That's not how compositors work nowadays. Your problem is that you don't
have the guest/host compositor in sync. zero-copy only changes the timing,
so it changes things from "rendering way too many frames" to "rendering
way too few frames".
We need to fix the timing/sync issue here first, not paper over it with
hacks.
[Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
independent of the scheduling policies to ensure that it works with all compositors.
 IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
configurable repaint-window value, refresh-rate, etc to determine when to start
its next repaint -- if there is any damage:
timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor->repaint_msec);
And, in the case of VKMS, since there is no real hardware, the timestamp is always:
now = ktime_get();
send_vblank_event(dev, e, seq, now);
vkms has been fixed since a while to fake high-precision timestamps like
from a real display.
...
When you say that the Guest/Host compositor need to stay in sync, are you 
suggesting that we need to ensure that the vblank timestamp on the Host 
needs to be shared and be the same on the Guest and a vblank/pageflip
completion for the Guest needs to be sent at exactly the same time it is sent
on the Host? If yes, I'd say that we do send the pageflip completion to Guest
around the same time a vblank is generated on the Host but it does not help
because the Guest compositor would only have 9 ms to submit a new frame
and if the Host is running Mutter, the Guest would only have 2 ms.
(https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
Not at the same time, but the same timestamp. And yes there is some fun
there, which is I think the fundamental issue. Or at least some of the
compositor experts seem to think so, and it makes sense to me.
...
...
Only, and I really mean only, when that shows that it's simply impossible
to hit 60fps with zero-copy and the guest/host fully aligned should we
look into making the overall pipeline deeper.
[Kasireddy, Vivek] From all the experiments conducted so far and given the
discussion associated with https://gitlab.freedesktop.org/wayland/weston/-/issues/514
I think we have already established that in order for a zero-copy solution to work 
reliably, the Guest compositor needs to start its repaint cycle when the Host
compositor sends a frame callback event to its clients.
...
...
...
Only when that all shows that we just can't hit 60fps consistently and
really need 3 buffers in flight should we look at deeper kms queues.
And then we really need to implement them properly and not with a
mismatch between drm_event an out-fence signalling. These quick hacks
are good for experiments, but there's a pile of other things we need
to do first. At least that's how I understand the problem here right
now.
[Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS consistently
-- in a zero-copy way independent of compositors' delays/deadlines -- with this
patch series + the Weston MR I linked in the cover letter. The main reason why this
works is because we relax the assumption that when the Guest compositor gets a
pageflip completion event that it could reuse the old FB it submitted in the previous
atomic flip and instead force it to use a new one. And, we send the pageflip completion
event to the Guest when the Host compositor sends a frame callback event. Lastly,
we use the (deferred) out_fence as just a mechanism to tell the Guest compositor when
it can release references on old FBs so that they can be reused again.
With that being said, the only question is how can we accomplish the above in an
upstream
...
acceptable way without regressing anything particularly on bare-metal. Its not clear if
just
...
increasing the queue depth would work or not but I think the Guest compositor has to be
told
...
when it can start its repaint cycle and when it can assume the old FB is no longer in use.
On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates
both.
...
In other words, Vblank event is the same as Flip done, which makes sense on bare-metal.
But if we were to have two events at-least for VKMS: vblank to indicate to Guest to start
repaint and flip_done to indicate to drop references on old FBs, I think this problem can
be solved even without increasing the queue depth. Can this be acceptable?
That's just another flavour of your "increase queue depth without
increasing the atomic queue depth" approach. I still think the underlying
fundamental issue is a timing confusion, and the fact that adjusting the
timings fixes things too kinda proves that. So we need to fix that in a
clean way, not by shuffling things around semi-randomly until the specific
config we tests works.
[Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
have carefully instrumented both the Host and Guest compositors and measured
the latencies at each step. The relevant debug data only points to the scheduling
policy -- of both Host and Guest compositors -- playing a role in Guest rendering 
at 30 FPS.
Hm but that essentially means that the events your passing around have an
even more ad-hoc implementation specific meaning: Essentially it's the
kick-off for the guest's repaint loop? That sounds even worse for a kms
uapi extension.
...
...
Iow I think we need a solution here which both slows down the 90fps to
60fps for the blit case, and the 30fps speed up to 60fps for the zerocopy
case. Because the host might need to switch transparently between blt and
zerocopy for various reasons.
[Kasireddy, Vivek] As I mentioned above, the Host (Qemu) cannot switch UI
backends at runtime. In other words, with GTK UI backend, it is always Blit
whereas Wayland UI backend is always zero-copy.
Hm ok, that at least makes things somewhat simpler. Another thing that I
just realized: What happens when the host changes screen resolution and
especially refresh rate?
-Daniel
...
Thanks,
Vivek
...
-Daniel
...
Thanks,
Vivek
...
Cheers, Daniel
...
--
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability