Hi Nicolas,
I'm detaching this thread from our V4L2 stateless decoding spec since it has drifted off and would certainly be interesting to DRM folks as well!
For context: I was initially talking about writing up support for the Allwinner 2D engine as a DRM render driver, where I'd like to be able to batch jobs that affect the same destination buffer to only signal the out fence once when the batch is done. We have a similar issue in v4l2 where we'd like the destination buffer for a set of requests (each covering one H264 slice) to be marked as done once the set was decoded.
Le mercredi 17 avril 2019 à 12:22 -0400, Nicolas Dufresne a écrit :
Interestingly, I'm experiencing the exact same problem dealing with a 2D graphics blitter that has limited ouput scaling abilities which imply handlnig a large scaling operation as multiple clipped smaller scaling operations. The issue is basically that multiple jobs have to be submitted to complete a single frame and relying on an indication from the destination buffer (such as a fence) doesn't work to indicate that all the operations were completed, since we get the indication at each step instead of at the end of the batch.
That looks similar to the IMX.6 IPU m2m driver. It splits the image in tiles of 1024x1024 and process each tile separately. This driver has been around for a long time, so I guess they have a solution to that. They don't need requests, because there is nothing to be bundled with the input image. I know that Renesas folks have started working on a de-interlacer. Again, this kind of driver may process and reuse input buffers for motion compensation, but I don't think they need special userspace API for that.
Thanks for the reference! I hope it's not a blitter that was contributed as a V4L2 driver instead of DRM, as it probably would be more useful in DRM (but that's way beside the point).
DRM does not offer a generic and discoverable interface for these accelerators. Note that these drivers have most of the time started as DRM driver and their DRM side where dropped. That was the case for Exynos drivers at least.
Heh, sadly I'm aware of how things turn out most of the time. The thing is that DRM expects drivers to implement their own interface. That's fine for passing BOs with GPU bitstream and textures, but not so much for dealing with framebuffer-based operations where the streaming and buffer interface that v4l2 has is a good fit.
There's also the fact that the 2D pipeline is fixed-function and highly hardware-specific, so we need driver-specific job descriptions to really make the most of it. That's where v4l2 is not much of a good fit for complex 2D pipelines either. Most 2D engines can take multiple inputs and blit them together in various ways, which is too far from what v4l2 deals with. So we can have fixed single-buffer pipelines with at best CSC and scaling, but not much more with v4l2 really.
I don't think it would be too much work to bring an interface to DRM in order to describe render framebuffers (we only have display framebuffers so far), with a simple queuing interface for scheduling driver-specific jobs, which could be grouped together to only signal the out fences when every buffer of the batch was done being rendered. This last point would allow handling cases where userapce need to perform multiple operations to carry out the single operation that it needs to do. In the case of my 2D blitter, that would be scaling above a 1024x1024 destination, which could be required to scaling a video buffer up to a 1920x1080 display. With that, we can e.g. page flip the 2D engine destination buffer and be certain that scaling will be fully done when the fence is signaled.
There's also the userspace problem: DRM render has mesa to back it in userspace and provide a generic API for other programes. For 2D engines, we don't have much to hold on to. Cairo has a DRM render interface that supports a few DRM render drivers where there is either a 2D pipeline or where pre-built shaders are used to implement a 2D pipeline, and that's about it as far as I know.
There's also the possibility of writing up a drm-render DDX to handle these 2D blitters that can make things a lot faster when running a desktop environment. As for wayland, well, I don't really know what to think. I was under the impression that it relies on GL for 2D operations, but am really not sure how true that actually is.
The thing is that DRM is great if you do immediate display stuff, while V4L2 is nice if you do streaming, where you expect filling queued, and popping buffers from queues.
In the end, this is just an interface, nothing prevents you from making an internal driver (like the Meson Canvas) and simply letting multiple sub-system expose it. Specially that some of these IP will often support both signal and memory processing, so they equally fit into a media controller ISP, a v4l2 m2m or a DRM driver.
Having base drivers that can hook to both v4l2 m2m and DRM would definitely be awesome. Maybe we could have some common internal synchronization logic to make writing these drivers easier.
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
Anyway, that's my 2 cents about the situation and what we can do to improve it. I'm definitely interested in tackling these items, but it may take some time before we get there. Not to mention we need to rework media/v4l2 for per-slice decoding support ;)
Another driver you might want to look is Rockchip RGA driver (which is a multi function IP, including blitting).
Yep, I've aware of it as well. There's also vivante which exposes 2D cores but I'm really not sure whether any function is actually implemented.
OMAP4 and OMAP5 have a 2D engine that seems to be vivante as well from what I could find out, but it seems to only have blobs for bltsville and no significant docs.
Cheers,
Paul
On Wed, Apr 17, 2019 at 08:10:15PM +0200, Paul Kocialkowski wrote:
Hi Nicolas,
I'm detaching this thread from our V4L2 stateless decoding spec since it has drifted off and would certainly be interesting to DRM folks as well!
For context: I was initially talking about writing up support for the Allwinner 2D engine as a DRM render driver, where I'd like to be able to batch jobs that affect the same destination buffer to only signal the out fence once when the batch is done. We have a similar issue in v4l2 where we'd like the destination buffer for a set of requests (each covering one H264 slice) to be marked as done once the set was decoded.
Le mercredi 17 avril 2019 à 12:22 -0400, Nicolas Dufresne a écrit :
Interestingly, I'm experiencing the exact same problem dealing with a 2D graphics blitter that has limited ouput scaling abilities which imply handlnig a large scaling operation as multiple clipped smaller scaling operations. The issue is basically that multiple jobs have to be submitted to complete a single frame and relying on an indication from the destination buffer (such as a fence) doesn't work to indicate that all the operations were completed, since we get the indication at each step instead of at the end of the batch.
That looks similar to the IMX.6 IPU m2m driver. It splits the image in tiles of 1024x1024 and process each tile separately. This driver has been around for a long time, so I guess they have a solution to that. They don't need requests, because there is nothing to be bundled with the input image. I know that Renesas folks have started working on a de-interlacer. Again, this kind of driver may process and reuse input buffers for motion compensation, but I don't think they need special userspace API for that.
Thanks for the reference! I hope it's not a blitter that was contributed as a V4L2 driver instead of DRM, as it probably would be more useful in DRM (but that's way beside the point).
DRM does not offer a generic and discoverable interface for these accelerators. Note that these drivers have most of the time started as DRM driver and their DRM side where dropped. That was the case for Exynos drivers at least.
Heh, sadly I'm aware of how things turn out most of the time. The thing is that DRM expects drivers to implement their own interface. That's fine for passing BOs with GPU bitstream and textures, but not so much for dealing with framebuffer-based operations where the streaming and buffer interface that v4l2 has is a good fit.
There's also the fact that the 2D pipeline is fixed-function and highly hardware-specific, so we need driver-specific job descriptions to really make the most of it. That's where v4l2 is not much of a good fit for complex 2D pipelines either. Most 2D engines can take multiple inputs and blit them together in various ways, which is too far from what v4l2 deals with. So we can have fixed single-buffer pipelines with at best CSC and scaling, but not much more with v4l2 really.
I don't think it would be too much work to bring an interface to DRM in order to describe render framebuffers (we only have display framebuffers so far), with a simple queuing interface for scheduling driver-specific jobs, which could be grouped together to only signal the out fences when every buffer of the batch was done being rendered. This last point would allow handling cases where userapce need to perform multiple operations to carry out the single operation that it needs to do. In the case of my 2D blitter, that would be scaling above a 1024x1024 destination, which could be required to scaling a video buffer up to a 1920x1080 display. With that, we can e.g. page flip the 2D engine destination buffer and be certain that scaling will be fully done when the fence is signaled.
There's also the userspace problem: DRM render has mesa to back it in userspace and provide a generic API for other programes. For 2D engines, we don't have much to hold on to. Cairo has a DRM render interface that supports a few DRM render drivers where there is either a 2D pipeline or where pre-built shaders are used to implement a 2D pipeline, and that's about it as far as I know.
There's also the possibility of writing up a drm-render DDX to handle these 2D blitters that can make things a lot faster when running a desktop environment. As for wayland, well, I don't really know what to think. I was under the impression that it relies on GL for 2D operations, but am really not sure how true that actually is.
Just fyi in case you folks aren't aware, I typed up a blog a while ago about why drm doesn't have a 2d submit api:
https://blog.ffwll.ch/2018/08/no-2d-in-drm.html
The thing is that DRM is great if you do immediate display stuff, while V4L2 is nice if you do streaming, where you expect filling queued, and popping buffers from queues.
In the end, this is just an interface, nothing prevents you from making an internal driver (like the Meson Canvas) and simply letting multiple sub-system expose it. Specially that some of these IP will often support both signal and memory processing, so they equally fit into a media controller ISP, a v4l2 m2m or a DRM driver.
Having base drivers that can hook to both v4l2 m2m and DRM would definitely be awesome. Maybe we could have some common internal synchronization logic to make writing these drivers easier.
We have, it's called dma_fence. Ties into dma_bufs using reservation_objecsts.
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
Anyway, that's my 2 cents about the situation and what we can do to improve it. I'm definitely interested in tackling these items, but it may take some time before we get there. Not to mention we need to rework media/v4l2 for per-slice decoding support ;)
Another driver you might want to look is Rockchip RGA driver (which is a multi function IP, including blitting).
Yep, I've aware of it as well. There's also vivante which exposes 2D cores but I'm really not sure whether any function is actually implemented.
OMAP4 and OMAP5 have a 2D engine that seems to be vivante as well from what I could find out, but it seems to only have blobs for bltsville and no significant docs.
Yeah that's the usual approach for drm 2d drivers: You have a bespoke driver in userspace. Usually that means an X driver, but there's been talk to pimp the hwc interface to make that _the_ 2d accel interface. There's also fbdev ... *shudder*.
All of these options are geared towards ultimately displaying stuff on screens, not pure m2m 2d accel. -Daniel
Hi Daniel,
On Thu, 2019-04-18 at 10:18 +0200, Daniel Vetter wrote:
On Wed, Apr 17, 2019 at 08:10:15PM +0200, Paul Kocialkowski wrote:
Hi Nicolas,
I'm detaching this thread from our V4L2 stateless decoding spec since it has drifted off and would certainly be interesting to DRM folks as well!
For context: I was initially talking about writing up support for the Allwinner 2D engine as a DRM render driver, where I'd like to be able to batch jobs that affect the same destination buffer to only signal the out fence once when the batch is done. We have a similar issue in v4l2 where we'd like the destination buffer for a set of requests (each covering one H264 slice) to be marked as done once the set was decoded.
Le mercredi 17 avril 2019 à 12:22 -0400, Nicolas Dufresne a écrit :
Interestingly, I'm experiencing the exact same problem dealing with a 2D graphics blitter that has limited ouput scaling abilities which imply handlnig a large scaling operation as multiple clipped smaller scaling operations. The issue is basically that multiple jobs have to be submitted to complete a single frame and relying on an indication from the destination buffer (such as a fence) doesn't work to indicate that all the operations were completed, since we get the indication at each step instead of at the end of the batch.
That looks similar to the IMX.6 IPU m2m driver. It splits the image in tiles of 1024x1024 and process each tile separately. This driver has been around for a long time, so I guess they have a solution to that. They don't need requests, because there is nothing to be bundled with the input image. I know that Renesas folks have started working on a de-interlacer. Again, this kind of driver may process and reuse input buffers for motion compensation, but I don't think they need special userspace API for that.
Thanks for the reference! I hope it's not a blitter that was contributed as a V4L2 driver instead of DRM, as it probably would be more useful in DRM (but that's way beside the point).
DRM does not offer a generic and discoverable interface for these accelerators. Note that these drivers have most of the time started as DRM driver and their DRM side where dropped. That was the case for Exynos drivers at least.
Heh, sadly I'm aware of how things turn out most of the time. The thing is that DRM expects drivers to implement their own interface. That's fine for passing BOs with GPU bitstream and textures, but not so much for dealing with framebuffer-based operations where the streaming and buffer interface that v4l2 has is a good fit.
There's also the fact that the 2D pipeline is fixed-function and highly hardware-specific, so we need driver-specific job descriptions to really make the most of it. That's where v4l2 is not much of a good fit for complex 2D pipelines either. Most 2D engines can take multiple inputs and blit them together in various ways, which is too far from what v4l2 deals with. So we can have fixed single-buffer pipelines with at best CSC and scaling, but not much more with v4l2 really.
I don't think it would be too much work to bring an interface to DRM in order to describe render framebuffers (we only have display framebuffers so far), with a simple queuing interface for scheduling driver-specific jobs, which could be grouped together to only signal the out fences when every buffer of the batch was done being rendered. This last point would allow handling cases where userapce need to perform multiple operations to carry out the single operation that it needs to do. In the case of my 2D blitter, that would be scaling above a 1024x1024 destination, which could be required to scaling a video buffer up to a 1920x1080 display. With that, we can e.g. page flip the 2D engine destination buffer and be certain that scaling will be fully done when the fence is signaled.
There's also the userspace problem: DRM render has mesa to back it in userspace and provide a generic API for other programes. For 2D engines, we don't have much to hold on to. Cairo has a DRM render interface that supports a few DRM render drivers where there is either a 2D pipeline or where pre-built shaders are used to implement a 2D pipeline, and that's about it as far as I know.
There's also the possibility of writing up a drm-render DDX to handle these 2D blitters that can make things a lot faster when running a desktop environment. As for wayland, well, I don't really know what to think. I was under the impression that it relies on GL for 2D operations, but am really not sure how true that actually is.
Just fyi in case you folks aren't aware, I typed up a blog a while ago about why drm doesn't have a 2d submit api:
I definitely share the observation that each 2D engine has its own kind of pipeline, which is close to impossible to describe in a generic way while exposing all the possible features of the pipeline.
I thought about this some more yesterday and I see a few areas that could however be made generic: * GEM allocation for framebuffers (with a unified ioctl); * framebuffer management, (that's only in KMS for now and we need pretty much the same thing here); * some queuing mechanism, either for standalone submissions or groups of them.
So I started thinking about writing up a "DRM GFX" API which would provide this, instead of implementing it in my 2D blitter driver. There's a chance I'll submit a proposal of that along with my driver.
I am convinced the job submit ioctl needs to remain driver-specific to properly describe the pipeline though.
The thing is that DRM is great if you do immediate display stuff, while V4L2 is nice if you do streaming, where you expect filling queued, and popping buffers from queues.
In the end, this is just an interface, nothing prevents you from making an internal driver (like the Meson Canvas) and simply letting multiple sub-system expose it. Specially that some of these IP will often support both signal and memory processing, so they equally fit into a media controller ISP, a v4l2 m2m or a DRM driver.
Having base drivers that can hook to both v4l2 m2m and DRM would definitely be awesome. Maybe we could have some common internal synchronization logic to make writing these drivers easier.
We have, it's called dma_fence. Ties into dma_bufs using reservation_objecsts.
That's not what I meant: I'm talking about exposing the 2D engine capabilities through both DRM and V4L2 M2M, where the V4L2 M2M driver would be an internal client to DRM. So it's about using the same hardware with both APIs concurrently.
And while at it, we could allow detaching display pipeline elements that have intermediary writeback and exposing them as 2D engines through the same API (which would return busy when the block is used for the video pipeline).
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
Anyway, that's my 2 cents about the situation and what we can do to improve it. I'm definitely interested in tackling these items, but it may take some time before we get there. Not to mention we need to rework media/v4l2 for per-slice decoding support ;)
Another driver you might want to look is Rockchip RGA driver (which is a multi function IP, including blitting).
Yep, I've aware of it as well. There's also vivante which exposes 2D cores but I'm really not sure whether any function is actually implemented.
OMAP4 and OMAP5 have a 2D engine that seems to be vivante as well from what I could find out, but it seems to only have blobs for bltsville and no significant docs.
Yeah that's the usual approach for drm 2d drivers: You have a bespoke driver in userspace. Usually that means an X driver, but there's been talk to pimp the hwc interface to make that _the_ 2d accel interface. There's also fbdev ... *shudder*.
All of these options are geared towards ultimately displaying stuff on screens, not pure m2m 2d accel.
I think it would be good to have a specific library to translate between "standard" 2d ops (porter-duff blending and such) to driver- specific setup submit ioctls. Could be called "libdrm-gfx" and used by an associated DDX (as well as any other program that needs 2D ops acceleration).
Cheers,
Paul
On Thu, Apr 18, 2019 at 5:55 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
Hi Daniel,
On Thu, 2019-04-18 at 10:18 +0200, Daniel Vetter wrote:
On Wed, Apr 17, 2019 at 08:10:15PM +0200, Paul Kocialkowski wrote:
Hi Nicolas,
I'm detaching this thread from our V4L2 stateless decoding spec since it has drifted off and would certainly be interesting to DRM folks as well!
For context: I was initially talking about writing up support for the Allwinner 2D engine as a DRM render driver, where I'd like to be able to batch jobs that affect the same destination buffer to only signal the out fence once when the batch is done. We have a similar issue in v4l2 where we'd like the destination buffer for a set of requests (each covering one H264 slice) to be marked as done once the set was decoded.
Out of curiosity, what area did you find a 2D blitter useful for?
Best regards, Tomasz
Hi,
On Thu, 2019-04-18 at 18:09 +0900, Tomasz Figa wrote:
On Thu, Apr 18, 2019 at 5:55 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
Hi Daniel,
On Thu, 2019-04-18 at 10:18 +0200, Daniel Vetter wrote:
On Wed, Apr 17, 2019 at 08:10:15PM +0200, Paul Kocialkowski wrote:
Hi Nicolas,
I'm detaching this thread from our V4L2 stateless decoding spec since it has drifted off and would certainly be interesting to DRM folks as well!
For context: I was initially talking about writing up support for the Allwinner 2D engine as a DRM render driver, where I'd like to be able to batch jobs that affect the same destination buffer to only signal the out fence once when the batch is done. We have a similar issue in v4l2 where we'd like the destination buffer for a set of requests (each covering one H264 slice) to be marked as done once the set was decoded.
Out of curiosity, what area did you find a 2D blitter useful for?
The initial motivation is to bring up a DDX with that for platforms that have 2D engines but no free software GPU drivers yet.
I also have a personal project in the works where I'd like to implement accelerated UI rendering in 2D. The idea is to avoid using GL entirely.
That last point is in part because I have a GPU-less device that I want to get going with mainline: http://linux-sunxi.org/F60_Action_Camera
Cheers,
Paul
On Thu, Apr 18, 2019 at 6:14 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
Hi,
On Thu, 2019-04-18 at 18:09 +0900, Tomasz Figa wrote:
On Thu, Apr 18, 2019 at 5:55 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
Hi Daniel,
On Thu, 2019-04-18 at 10:18 +0200, Daniel Vetter wrote:
On Wed, Apr 17, 2019 at 08:10:15PM +0200, Paul Kocialkowski wrote:
Hi Nicolas,
I'm detaching this thread from our V4L2 stateless decoding spec since it has drifted off and would certainly be interesting to DRM folks as well!
For context: I was initially talking about writing up support for the Allwinner 2D engine as a DRM render driver, where I'd like to be able to batch jobs that affect the same destination buffer to only signal the out fence once when the batch is done. We have a similar issue in v4l2 where we'd like the destination buffer for a set of requests (each covering one H264 slice) to be marked as done once the set was decoded.
Out of curiosity, what area did you find a 2D blitter useful for?
The initial motivation is to bring up a DDX with that for platforms that have 2D engines but no free software GPU drivers yet.
I also have a personal project in the works where I'd like to implement accelerated UI rendering in 2D. The idea is to avoid using GL entirely.
That last point is in part because I have a GPU-less device that I want to get going with mainline: http://linux-sunxi.org/F60_Action_Camera
Okay, thanks.
I feel like the typical DRM model with a render node and a userspace library would make sense for these specific use cases on these specific hardware platforms then.
Hopefully the availability of open drivers for 3D engines continues to improve.
Best regards, Tomasz
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
For the second, it's complicated on V4L2 side. Currently we signal buffers when they are ready in the display order. With fences, we receive early pairs buffer and fence (in decoding order). There exist cases where reordering is done by the driver (stateful CODEC). We cannot schedule these immediately we would need a new mechanism to know which one come next. If we just reuse current mechnism, it would void the fence usage since the fence will always be signalled by the time it reaches DRM or other v4l2 component.
There also other issues, for video capture pipeline, if you are not rendering ASAP, you need the HW timestamp in order to schedule. Again, we'd get the fence early, but the actual timestamp will be signalled at the very last minutes, so we also risk of turning the fence into pure overhead. Note that as we speak, I have colleagues who are experimenting with frame timestamp prediction that slaves to the effective timestamp (catching up over time). But we still have issues when the capture driver skipped a frame (missed a capture window).
I hope this is useful reflection data, Nicolas
On Fri, Apr 19, 2019 at 9:30 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
For the second, it's complicated on V4L2 side. Currently we signal buffers when they are ready in the display order. With fences, we receive early pairs buffer and fence (in decoding order). There exist cases where reordering is done by the driver (stateful CODEC). We cannot schedule these immediately we would need a new mechanism to know which one come next. If we just reuse current mechnism, it would void the fence usage since the fence will always be signalled by the time it reaches DRM or other v4l2 component.
There also other issues, for video capture pipeline, if you are not rendering ASAP, you need the HW timestamp in order to schedule. Again, we'd get the fence early, but the actual timestamp will be signalled at the very last minutes, so we also risk of turning the fence into pure overhead. Note that as we speak, I have colleagues who are experimenting with frame timestamp prediction that slaves to the effective timestamp (catching up over time). But we still have issues when the capture driver skipped a frame (missed a capture window).
Note that a fence has a timestamp internally and it can be queried for it from the user space if exposed as a sync file: https://elixir.bootlin.com/linux/v5.1-rc5/source/drivers/dma-buf/sync_file.c...
Fences in V4L2 would be also useful for stateless decoders and any mem-to-mem processors that operate in order, like the blitters mentioned here or actually camera ISPs, which can be often chained into relatively sophisticated pipelines.
Best regards, Tomasz
Le vendredi 19 avril 2019 à 13:27 +0900, Tomasz Figa a écrit :
On Fri, Apr 19, 2019 at 9:30 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
For the second, it's complicated on V4L2 side. Currently we signal buffers when they are ready in the display order. With fences, we receive early pairs buffer and fence (in decoding order). There exist cases where reordering is done by the driver (stateful CODEC). We cannot schedule these immediately we would need a new mechanism to know which one come next. If we just reuse current mechnism, it would void the fence usage since the fence will always be signalled by the time it reaches DRM or other v4l2 component.
There also other issues, for video capture pipeline, if you are not rendering ASAP, you need the HW timestamp in order to schedule. Again, we'd get the fence early, but the actual timestamp will be signalled at the very last minutes, so we also risk of turning the fence into pure overhead. Note that as we speak, I have colleagues who are experimenting with frame timestamp prediction that slaves to the effective timestamp (catching up over time). But we still have issues when the capture driver skipped a frame (missed a capture window).
Note that a fence has a timestamp internally and it can be queried for it from the user space if exposed as a sync file: https://elixir.bootlin.com/linux/v5.1-rc5/source/drivers/dma-buf/sync_file.c...
Don't we need something the other way around ? This seems to be the timestamp of when it was triggered (I'm not familiar with this though).
Fences in V4L2 would be also useful for stateless decoders and any mem-to-mem processors that operate in order, like the blitters mentioned here or actually camera ISPs, which can be often chained into relatively sophisticated pipelines.
I agree fence can be used to optimize specific corner cases. They are not as critical in V4L2 since we have async queues. I think the use case for fences in V4L2 is mostly to lower the latency. Not all use cases requires such a low latency. There was argument around fences that is simplify the the code, I haven't seen a compelling argument demonstrating that this would be the case for V4L2 programming. The only case is when doing V4L2 to DRM exchanges, and only in the context where time synchronization does not matter. In fact, so far it is more work since information starts flowing through separate events (buffer/fence first, later timestamps and possibly critical metadata. This might be induced by the design, but clearly there is a slight API clash.
Best regards, Tomasz
On Sat, Apr 20, 2019 at 12:31 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:
Le vendredi 19 avril 2019 à 13:27 +0900, Tomasz Figa a écrit :
On Fri, Apr 19, 2019 at 9:30 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
For the second, it's complicated on V4L2 side. Currently we signal buffers when they are ready in the display order. With fences, we receive early pairs buffer and fence (in decoding order). There exist cases where reordering is done by the driver (stateful CODEC). We cannot schedule these immediately we would need a new mechanism to know which one come next. If we just reuse current mechnism, it would void the fence usage since the fence will always be signalled by the time it reaches DRM or other v4l2 component.
There also other issues, for video capture pipeline, if you are not rendering ASAP, you need the HW timestamp in order to schedule. Again, we'd get the fence early, but the actual timestamp will be signalled at the very last minutes, so we also risk of turning the fence into pure overhead. Note that as we speak, I have colleagues who are experimenting with frame timestamp prediction that slaves to the effective timestamp (catching up over time). But we still have issues when the capture driver skipped a frame (missed a capture window).
Note that a fence has a timestamp internally and it can be queried for it from the user space if exposed as a sync file: https://elixir.bootlin.com/linux/v5.1-rc5/source/drivers/dma-buf/sync_file.c...
Don't we need something the other way around ? This seems to be the timestamp of when it was triggered (I'm not familiar with this though).
Honestly, I'm not fully sure what this timestamp is expected to be.
For video capture pipeline the fence would signal once the whole frame is captured, so I think it could be a reasonable value to consider later in the pipeline?
Fences in V4L2 would be also useful for stateless decoders and any mem-to-mem processors that operate in order, like the blitters mentioned here or actually camera ISPs, which can be often chained into relatively sophisticated pipelines.
I agree fence can be used to optimize specific corner cases. They are not as critical in V4L2 since we have async queues.
I wouldn't call those corner cases. A stateful decoder is actually one of the opposite extremes, because one would normally just decode and show the frame, so not much complexity needed to handle it and async queues actually work quite well.
I don't think async queues are very helpful for any more complicated use cases. The userspace still needs to wake up and push the buffers through the pipeline. If you have some depth across the whole pipeline, with queues always having some buffers waiting to be processed, fences indeed wouldn't change too much (+/- the CPU time/power wasted on context switches). However, with real time use cases, such as anything involving streaming from cameras, image processing stages and encoding into a stream to be passed to a latency-sensitive application, such as WebRTC, the latency imposed by the lack of fences would be significant. Especially if the image processing in between consists of several inter-dependent stages.
I think the use case for fences in V4L2 is mostly to lower the latency. Not all use cases requires such a low latency.
Indeed, not all, but I think it doesn't make fences less important, given that there are use cases that require such a low latency.
There was argument around fences that is simplify the the code, I haven't seen a compelling argument demonstrating that this would be the case for V4L2 programming. The only case is when doing V4L2 to DRM exchanges, and only in the context where time synchronization does not matter.
Another huge use case would be Android. The lack of fences is a significant show stopper for V4L2 adoption there.
Also, V4L2 to GPU (GLES, Vulkan) exchange should not be forgotten too.
In fact, so far it is more work since information starts flowing through separate events (buffer/fence first, later timestamps and possibly critical metadata. This might be induced by the design, but clearly there is a slight API clash.
Well, nothing is perfect from the start. (In fact, probably nothing is perfect in general. ;))
Best regards, Tomasz
Hi,
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
Definitely, it feels like the DRM display side is currently a good fit for render use cases, but not so much for precise display cases where we want to try and display a buffer at a given vblank target instead of "as soon as possible".
I have a userspace project where I've implemented a page flip queue, which only schedules the next flip when relevant and keeps ready buffers in the queue until then. This requires explicit vblank syncronisation (which DRM offsers, but pretty much all other display APIs, that are higher-level don't, so I'm just using a refresh-rate timer for them) and flip done notification.
I haven't looked too much at how to flip with a target vblank with DRM directly but maybe the atomic API already has the bits in for that (but I haven't heard of such a thing as a buffer queue, so that makes me doubt it). Well, I need to handle stuff like SDL in my userspace project, so I have to have all that queuing stuff in software anyway, but it would be good if each project didn't have to implement that. Worst case, it could be in libdrm too.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
So maybe adding this queue in DRM directly would make everyone's life much easier for non-render applications.
I feel like specifying a target vblank would be a good unit for that, since it's our native granularity after all (while a timestamp is not).
For the second, it's complicated on V4L2 side. Currently we signal buffers when they are ready in the display order. With fences, we receive early pairs buffer and fence (in decoding order). There exist cases where reordering is done by the driver (stateful CODEC). We cannot schedule these immediately we would need a new mechanism to know which one come next. If we just reuse current mechnism, it would void the fence usage since the fence will always be signalled by the time it reaches DRM or other v4l2 component.
Well, our v4l2 buffers do have a timestamp and fences expose it too, so we'd need DRM to convert that to a target vblank and add it to the internal queue mentioned above. That seems doable.
I think we only gave a vague meaning to the v4l2 timestamp for the decoding case and it could be any number, the timestamp when submitting decoding or the target timestamp for the frame. I think we should aim for the latter, but not sure it's always doable to know beforehand. Perhaps you have a clear idea of this?
There also other issues, for video capture pipeline, if you are not rendering ASAP, you need the HW timestamp in order to schedule. Again, we'd get the fence early, but the actual timestamp will be signalled at the very last minutes, so we also risk of turning the fence into pure overhead. Note that as we speak, I have colleagues who are experimenting with frame timestamp prediction that slaves to the effective timestamp (catching up over time). But we still have issues when the capture driver skipped a frame (missed a capture window).
I hope this is useful reflection data,
It is definitely very useful and there seems to be a few things that could be improved already without too much effort.
Cheers,
Paul
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
Definitely, it feels like the DRM display side is currently a good fit for render use cases, but not so much for precise display cases where we want to try and display a buffer at a given vblank target instead of "as soon as possible".
I have a userspace project where I've implemented a page flip queue, which only schedules the next flip when relevant and keeps ready buffers in the queue until then. This requires explicit vblank syncronisation (which DRM offsers, but pretty much all other display APIs, that are higher-level don't, so I'm just using a refresh-rate timer for them) and flip done notification.
I haven't looked too much at how to flip with a target vblank with DRM directly but maybe the atomic API already has the bits in for that (but I haven't heard of such a thing as a buffer queue, so that makes me doubt it).
Not directly. What's available is that if userspace waits for vblank n and then submits a flip, the flip will complete in vblank n+1 (or a later vblank, depending on when the flip is submitted and when the fences the flip depends on signal).
There is reluctance allowing more than one flip to be queued in the kernel, as it would considerably increase complexity in the kernel. It would probably only be considered if there was a compelling use-case which was outright impossible otherwise.
Well, I need to handle stuff like SDL in my userspace project, so I have to have all that queuing stuff in software anyway, but it would be good if each project didn't have to implement that. Worst case, it could be in libdrm too.
Usually, this kind of queuing will be handled in a display server such as Xorg or a Wayland compositor, not by the application such as a video player itself, or any library in the latter's address space. I'm not sure there's much potential for sharing code between display servers for this.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
Fences are about signalling that the contents of a frame are "done" and ready to be presented. They're not about specifying which frame is to be presented when.
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
Le mercredi 24 avril 2019 à 10:31 +0200, Michel Dänzer a écrit :
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
Definitely, it feels like the DRM display side is currently a good fit for render use cases, but not so much for precise display cases where we want to try and display a buffer at a given vblank target instead of "as soon as possible".
I have a userspace project where I've implemented a page flip queue, which only schedules the next flip when relevant and keeps ready buffers in the queue until then. This requires explicit vblank syncronisation (which DRM offsers, but pretty much all other display APIs, that are higher-level don't, so I'm just using a refresh-rate timer for them) and flip done notification.
I haven't looked too much at how to flip with a target vblank with DRM directly but maybe the atomic API already has the bits in for that (but I haven't heard of such a thing as a buffer queue, so that makes me doubt it).
Not directly. What's available is that if userspace waits for vblank n and then submits a flip, the flip will complete in vblank n+1 (or a later vblank, depending on when the flip is submitted and when the fences the flip depends on signal).
There is reluctance allowing more than one flip to be queued in the kernel, as it would considerably increase complexity in the kernel. It would probably only be considered if there was a compelling use-case which was outright impossible otherwise.
Well, I need to handle stuff like SDL in my userspace project, so I have to have all that queuing stuff in software anyway, but it would be good if each project didn't have to implement that. Worst case, it could be in libdrm too.
Usually, this kind of queuing will be handled in a display server such as Xorg or a Wayland compositor, not by the application such as a video player itself, or any library in the latter's address space. I'm not sure there's much potential for sharing code between display servers for this.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
Fences are about signalling that the contents of a frame are "done" and ready to be presented. They're not about specifying which frame is to be presented when.
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
Rendering a video stream is more complex then what you describe here. Whenever there is a unexpected delay (late delivery of a frame as an example) you may endup in situation where one frame is ready after the targeted vblank. If there is another frame that targets the following vblank that gets ready on-time, the previous frame should be replaced by the most recent one.
With fences, what happens is that even if you received the next frame on time, naively replacing it is not possible, because we don't know when the fence for the next frame will be signalled. If you simply always replace the current frame, you may endup skipping a lot more vblank then what you expect, and that results in jumpy playback.
Render queues with timestamp are used to smooth rendering and handle rendering collision so that the latency is kept low (like when you have a 100fps video over a 60Hz display). This is normally done in userspace, but with fences, you ask the kernel to render something in an unpredictable future, so we loose the ability to make the final decision.
On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 10:31 +0200, Michel Dänzer a écrit :
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
Fences are about signalling that the contents of a frame are "done" and ready to be presented. They're not about specifying which frame is to be presented when.
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
Rendering a video stream is more complex then what you describe here. Whenever there is a unexpected delay (late delivery of a frame as an example) you may endup in situation where one frame is ready after the targeted vblank. If there is another frame that targets the following vblank that gets ready on-time, the previous frame should be replaced by the most recent one.
With fences, what happens is that even if you received the next frame on time, naively replacing it is not possible, because we don't know when the fence for the next frame will be signalled. If you simply always replace the current frame, you may endup skipping a lot more vblank then what you expect, and that results in jumpy playback.
So you want to be able to replace a queued flip with another one then. That doesn't necessarily require allowing more than one flip to be queued ahead of time.
Note that this can also be done in userspace with explicit fencing (by only selecting a frame and submitting it to the kernel after all corresponding fences have signalled), at least to some degree, but the kernel should be able to do it up to a later point in time and more reliably, with less risk of missing a flip for a frame which becomes ready just in time.
Render queues with timestamp are used to smooth rendering and handle rendering collision so that the latency is kept low (like when you have a 100fps video over a 60Hz display). This is normally done in userspace, but with fences, you ask the kernel to render something in an unpredictable future, so we loose the ability to make the final decision.
That's just not what fences are intended to be used for with the current KMS UAPI.
Hi,
On Wed, 2019-04-24 at 16:39 +0200, Michel Dänzer wrote:
On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 10:31 +0200, Michel Dänzer a écrit :
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit : In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
Fences are about signalling that the contents of a frame are "done" and ready to be presented. They're not about specifying which frame is to be presented when.
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
Rendering a video stream is more complex then what you describe here. Whenever there is a unexpected delay (late delivery of a frame as an example) you may endup in situation where one frame is ready after the targeted vblank. If there is another frame that targets the following vblank that gets ready on-time, the previous frame should be replaced by the most recent one.
With fences, what happens is that even if you received the next frame on time, naively replacing it is not possible, because we don't know when the fence for the next frame will be signalled. If you simply always replace the current frame, you may endup skipping a lot more vblank then what you expect, and that results in jumpy playback.
So you want to be able to replace a queued flip with another one then. That doesn't necessarily require allowing more than one flip to be queued ahead of time.
There might be other ways to do it, but this one has plenty of advantages.
Note that this can also be done in userspace with explicit fencing (by only selecting a frame and submitting it to the kernel after all corresponding fences have signalled), at least to some degree, but the kernel should be able to do it up to a later point in time and more reliably, with less risk of missing a flip for a frame which becomes ready just in time.
Indeed, but it would be great if we could do that with implicit fencing as well.
Render queues with timestamp are used to smooth rendering and handle rendering collision so that the latency is kept low (like when you have a 100fps video over a 60Hz display). This is normally done in userspace, but with fences, you ask the kernel to render something in an unpredictable future, so we loose the ability to make the final decision.
That's just not what fences are intended to be used for with the current KMS UAPI.
Yes, and I think we're discussing towards changing that in the future.
Cheers,
Paul
On Wed, Apr 24, 2019 at 4:41 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
Hi,
On Wed, 2019-04-24 at 16:39 +0200, Michel Dänzer wrote:
On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 10:31 +0200, Michel Dänzer a écrit :
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit : In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
Fences are about signalling that the contents of a frame are "done" and ready to be presented. They're not about specifying which frame is to be presented when.
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
Rendering a video stream is more complex then what you describe here. Whenever there is a unexpected delay (late delivery of a frame as an example) you may endup in situation where one frame is ready after the targeted vblank. If there is another frame that targets the following vblank that gets ready on-time, the previous frame should be replaced by the most recent one.
With fences, what happens is that even if you received the next frame on time, naively replacing it is not possible, because we don't know when the fence for the next frame will be signalled. If you simply always replace the current frame, you may endup skipping a lot more vblank then what you expect, and that results in jumpy playback.
So you want to be able to replace a queued flip with another one then. That doesn't necessarily require allowing more than one flip to be queued ahead of time.
There might be other ways to do it, but this one has plenty of advantages.
The point of kms (well one of the reasons) was to separate the implementation of modesetting for specific hw from policy decisions like which frames to drop and how to schedule them. Kernel gives tools, userspace implements the actual protocols.
There's definitely a bit a gap around scheduling flips for a specific frame or allowing to cancel/overwrite an already scheduled flip, but no one yet has come up with a clear proposal for new uapi + example implementation + userspace implementation + big enough support from other compositors that this is what they want too.
And yes writing a really good compositor is really hard, and I think a lot of people underestimate that and just create something useful for their niche. If userspace can't come up with a shared library of helpers, I don't think baking it in as kernel uapi with 10+ years regression free api guarantees is going to make it any better.
Note that this can also be done in userspace with explicit fencing (by only selecting a frame and submitting it to the kernel after all corresponding fences have signalled), at least to some degree, but the kernel should be able to do it up to a later point in time and more reliably, with less risk of missing a flip for a frame which becomes ready just in time.
Indeed, but it would be great if we could do that with implicit fencing as well.
1. extract implicit fences from dma-buf. This part is just an idea, but easy to implement once we have someone who actually wants this. All we need is a new ioctl on the dma-buf to export the fences from the reservation_object as a sync_file (either the exclusive or the shared ones, selected with a flag). 2. do the exact same frame scheduling as with explicit fencing 3. supply explicit fences in your atomic ioctl calls - these should overrule any implicit fences (assuming correct kernel drivers, but we have helpers so you can assume they all work correctly).
By design this is possible, it's just that no one yet bothered enough to make it happen. -Daniel
Render queues with timestamp are used to smooth rendering and handle rendering collision so that the latency is kept low (like when you have a 100fps video over a 60Hz display). This is normally done in userspace, but with fences, you ask the kernel to render something in an unpredictable future, so we loose the ability to make the final decision.
That's just not what fences are intended to be used for with the current KMS UAPI.
Yes, and I think we're discussing towards changing that in the future.
Cheers,
Paul
-- Paul Kocialkowski, Bootlin Embedded Linux and kernel engineering https://bootlin.com
Le mercredi 24 avril 2019 à 17:06 +0200, Daniel Vetter a écrit :
On Wed, Apr 24, 2019 at 4:41 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
Hi,
On Wed, 2019-04-24 at 16:39 +0200, Michel Dänzer wrote:
On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 10:31 +0200, Michel Dänzer a écrit :
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote: > Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit : > In the first, we'd need a mechanism where we can schedule a render at a > specific time or vblank. We can of course already implement this in > software, but with fences, the scheduling would need to be done in the > driver. Then if the fence is signalled earlier, the driver should hold > on until the delay is met. If the fence got signalled late, we also > need to think of a workflow. As we can't schedule more then one render > in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
Fences are about signalling that the contents of a frame are "done" and ready to be presented. They're not about specifying which frame is to be presented when.
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
Rendering a video stream is more complex then what you describe here. Whenever there is a unexpected delay (late delivery of a frame as an example) you may endup in situation where one frame is ready after the targeted vblank. If there is another frame that targets the following vblank that gets ready on-time, the previous frame should be replaced by the most recent one.
With fences, what happens is that even if you received the next frame on time, naively replacing it is not possible, because we don't know when the fence for the next frame will be signalled. If you simply always replace the current frame, you may endup skipping a lot more vblank then what you expect, and that results in jumpy playback.
So you want to be able to replace a queued flip with another one then. That doesn't necessarily require allowing more than one flip to be queued ahead of time.
There might be other ways to do it, but this one has plenty of advantages.
The point of kms (well one of the reasons) was to separate the implementation of modesetting for specific hw from policy decisions like which frames to drop and how to schedule them. Kernel gives tools, userspace implements the actual protocols.
There's definitely a bit a gap around scheduling flips for a specific frame or allowing to cancel/overwrite an already scheduled flip, but no one yet has come up with a clear proposal for new uapi + example implementation + userspace implementation + big enough support from other compositors that this is what they want too.
And yes writing a really good compositor is really hard, and I think a lot of people underestimate that and just create something useful for their niche. If userspace can't come up with a shared library of helpers, I don't think baking it in as kernel uapi with 10+ years regression free api guarantees is going to make it any better.
Note that this can also be done in userspace with explicit fencing (by only selecting a frame and submitting it to the kernel after all corresponding fences have signalled), at least to some degree, but the kernel should be able to do it up to a later point in time and more reliably, with less risk of missing a flip for a frame which becomes ready just in time.
Indeed, but it would be great if we could do that with implicit fencing as well.
- extract implicit fences from dma-buf. This part is just an idea,
but easy to implement once we have someone who actually wants this. All we need is a new ioctl on the dma-buf to export the fences from the reservation_object as a sync_file (either the exclusive or the shared ones, selected with a flag). 2. do the exact same frame scheduling as with explicit fencing 3. supply explicit fences in your atomic ioctl calls - these should overrule any implicit fences (assuming correct kernel drivers, but we have helpers so you can assume they all work correctly).
By design this is possible, it's just that no one yet bothered enough to make it happen. -Daniel
I'm not sure I understand the workflow of this one. I'm all in favour leaving the hard work to userspace. Note that I have assumed explicit fences from the start, I don't think implicit fence will ever exist in v4l2, but I might be wrong. What I understood is that there was a previous attempt in the past but it raised more issues then it actually solved. So that being said, how do handle exactly the follow use cases:
- A frame was lost by capture driver, but it was schedule as being the next buffer to render (normally previous frame should remain). - The scheduled frame is late for the next vblank (didn't signal on- time), a new one may be better for the next vlbank, but we will only know when it's fence is signaled.
Better in this context means the the presentation time of this frame is closer to the next vblank time. Keep in mind that the idea is to schedule the frames before they are signal, in order to make the usage of the fence useful in lowering the latency. Of course as Michel said, we could just always wait on the fence and just schedule. But if you do that, why would you care implementing the fence in v4l2 to start with, DQBuf does just that already.
Note that this has nothing to do with the valid use case where you would want to apply various transformations (m2m or gpu) on the capture buffer. You still gain from the fence in the context, even if you wait in userspace on the fence before display. This alone is likely enough to justify using fences.
Render queues with timestamp are used to smooth rendering and handle rendering collision so that the latency is kept low (like when you have a 100fps video over a 60Hz display). This is normally done in userspace, but with fences, you ask the kernel to render something in an unpredictable future, so we loose the ability to make the final decision.
That's just not what fences are intended to be used for with the current KMS UAPI.
Yes, and I think we're discussing towards changing that in the future.
Cheers,
Paul
-- Paul Kocialkowski, Bootlin Embedded Linux and kernel engineering https://bootlin.com
On 2019-04-24 5:44 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 17:06 +0200, Daniel Vetter a écrit :
On Wed, Apr 24, 2019 at 4:41 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
On Wed, 2019-04-24 at 16:39 +0200, Michel Dänzer wrote:
On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote:
Rendering a video stream is more complex then what you describe here. Whenever there is a unexpected delay (late delivery of a frame as an example) you may endup in situation where one frame is ready after the targeted vblank. If there is another frame that targets the following vblank that gets ready on-time, the previous frame should be replaced by the most recent one.
With fences, what happens is that even if you received the next frame on time, naively replacing it is not possible, because we don't know when the fence for the next frame will be signalled. If you simply always replace the current frame, you may endup skipping a lot more vblank then what you expect, and that results in jumpy playback.
So you want to be able to replace a queued flip with another one then. That doesn't necessarily require allowing more than one flip to be queued ahead of time.
There might be other ways to do it, but this one has plenty of advantages.
The point of kms (well one of the reasons) was to separate the implementation of modesetting for specific hw from policy decisions like which frames to drop and how to schedule them. Kernel gives tools, userspace implements the actual protocols.
There's definitely a bit a gap around scheduling flips for a specific frame or allowing to cancel/overwrite an already scheduled flip, but no one yet has come up with a clear proposal for new uapi + example implementation + userspace implementation + big enough support from other compositors that this is what they want too.
Actually, the ATOMIC_AMEND patches propose a way to replace a scheduled flip?
Note that this can also be done in userspace with explicit fencing (by only selecting a frame and submitting it to the kernel after all corresponding fences have signalled), at least to some degree, but the kernel should be able to do it up to a later point in time and more reliably, with less risk of missing a flip for a frame which becomes ready just in time.
Indeed, but it would be great if we could do that with implicit fencing as well.
- extract implicit fences from dma-buf. This part is just an idea,
but easy to implement once we have someone who actually wants this. All we need is a new ioctl on the dma-buf to export the fences from the reservation_object as a sync_file (either the exclusive or the shared ones, selected with a flag). 2. do the exact same frame scheduling as with explicit fencing 3. supply explicit fences in your atomic ioctl calls - these should overrule any implicit fences (assuming correct kernel drivers, but we have helpers so you can assume they all work correctly).
By design this is possible, it's just that no one yet bothered enough to make it happen. -Daniel
I'm not sure I understand the workflow of this one. I'm all in favour leaving the hard work to userspace. Note that I have assumed explicit fences from the start, I don't think implicit fence will ever exist in v4l2, but I might be wrong. What I understood is that there was a previous attempt in the past but it raised more issues then it actually solved. So that being said, how do handle exactly the follow use cases:
- A frame was lost by capture driver, but it was schedule as being the
next buffer to render (normally previous frame should remain).
Userspace just doesn't call into the kernel to flip to the lost frame, so the previous one remains.
- The scheduled frame is late for the next vblank (didn't signal on-
time), a new one may be better for the next vlbank, but we will only know when it's fence is signaled.
Userspace only selects a frame and submits it to the kernel after all its fences have signalled.
Better in this context means the the presentation time of this frame is closer to the next vblank time. Keep in mind that the idea is to schedule the frames before they are signal, in order to make the usage of the fence useful in lowering the latency.
Fences are about signalling completion, not about low latency.
With a display server, the client can send frames to the display server ahead of time, only the display server needs to wait for fences to signal before submitting frames to the kernel.
Of course as Michel said, we could just always wait on the fence and just schedule. But if you do that, why would you care implementing the fence in v4l2 to start with, DQBuf does just that already.
A fence is more likely to work out of the box with non-V4L-related code than DQBuf?
Le mercredi 24 avril 2019 à 18:54 +0200, Michel Dänzer a écrit :
On 2019-04-24 5:44 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 17:06 +0200, Daniel Vetter a écrit :
On Wed, Apr 24, 2019 at 4:41 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
On Wed, 2019-04-24 at 16:39 +0200, Michel Dänzer wrote:
On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote:
Rendering a video stream is more complex then what you describe here. Whenever there is a unexpected delay (late delivery of a frame as an example) you may endup in situation where one frame is ready after the targeted vblank. If there is another frame that targets the following vblank that gets ready on-time, the previous frame should be replaced by the most recent one.
With fences, what happens is that even if you received the next frame on time, naively replacing it is not possible, because we don't know when the fence for the next frame will be signalled. If you simply always replace the current frame, you may endup skipping a lot more vblank then what you expect, and that results in jumpy playback.
So you want to be able to replace a queued flip with another one then. That doesn't necessarily require allowing more than one flip to be queued ahead of time.
There might be other ways to do it, but this one has plenty of advantages.
The point of kms (well one of the reasons) was to separate the implementation of modesetting for specific hw from policy decisions like which frames to drop and how to schedule them. Kernel gives tools, userspace implements the actual protocols.
There's definitely a bit a gap around scheduling flips for a specific frame or allowing to cancel/overwrite an already scheduled flip, but no one yet has come up with a clear proposal for new uapi + example implementation + userspace implementation + big enough support from other compositors that this is what they want too.
Actually, the ATOMIC_AMEND patches propose a way to replace a scheduled flip?
Note that this can also be done in userspace with explicit fencing (by only selecting a frame and submitting it to the kernel after all corresponding fences have signalled), at least to some degree, but the kernel should be able to do it up to a later point in time and more reliably, with less risk of missing a flip for a frame which becomes ready just in time.
Indeed, but it would be great if we could do that with implicit fencing as well.
- extract implicit fences from dma-buf. This part is just an idea,
but easy to implement once we have someone who actually wants this. All we need is a new ioctl on the dma-buf to export the fences from the reservation_object as a sync_file (either the exclusive or the shared ones, selected with a flag). 2. do the exact same frame scheduling as with explicit fencing 3. supply explicit fences in your atomic ioctl calls - these should overrule any implicit fences (assuming correct kernel drivers, but we have helpers so you can assume they all work correctly).
By design this is possible, it's just that no one yet bothered enough to make it happen. -Daniel
I'm not sure I understand the workflow of this one. I'm all in favour leaving the hard work to userspace. Note that I have assumed explicit fences from the start, I don't think implicit fence will ever exist in v4l2, but I might be wrong. What I understood is that there was a previous attempt in the past but it raised more issues then it actually solved. So that being said, how do handle exactly the follow use cases:
- A frame was lost by capture driver, but it was schedule as being the
next buffer to render (normally previous frame should remain).
Userspace just doesn't call into the kernel to flip to the lost frame, so the previous one remains.
We are stuck in a loop you a me. Considering v4l2 to drm, where fences don't exist on v4l2, it makes very little sense to bring up fences if we are to wait on the fence in userspace. Unless of course you have other operations before end making a proper use of the fences.
- The scheduled frame is late for the next vblank (didn't signal on-
time), a new one may be better for the next vlbank, but we will only know when it's fence is signaled.
Userspace only selects a frame and submits it to the kernel after all its fences have signalled.
Better in this context means the the presentation time of this frame is closer to the next vblank time. Keep in mind that the idea is to schedule the frames before they are signal, in order to make the usage of the fence useful in lowering the latency.
Fences are about signalling completion, not about low latency.
It can be used to remove a roundtrip with userspace at a very time sensitive moment. If you pass a dmabuf with it's unsignalled fence to a kernel driver, the driver can start the job on this dmabuf as soon as the fence is signalled. If you always wait on a fence in userspace, you have to wait for the userspace process to be scheduled, then userspace will setup the drm atomic request or similar action, which may take some time and may require another process in the kernel to have to be schedule. This effectively adds some variable delay, a gap where nothing is happening between two operations. This time is lost and contributes to the overall operation latency.
The benefit of fences we are looking for is being able to setup before the fence is signalled the operations on various compatible drivers. This way, on the time critical moment a driver can be feed more jobs, there is no userspace rountrip involved. It is also proposed to use it to return the buffers into v4l2 queued when they are freed, which can in some conditions avoid let's say a capture driver from skipping due to random scheduling delays.
With a display server, the client can send frames to the display server ahead of time, only the display server needs to wait for fences to signal before submitting frames to the kernel.
Of course as Michel said, we could just always wait on the fence and just schedule. But if you do that, why would you care implementing the fence in v4l2 to start with, DQBuf does just that already.
A fence is more likely to work out of the box with non-V4L-related code than DQBuf?
If you use DQBuf, you are guarantied that the data has been produced. A fence is not useful on a buffer that already contains the data you would be waiting for. That's why the fence is provided in the RFC at QBUf, basically when the free buffer is given to the v4l2 driver. QBuf can also be passed a fence in the RFC, so if the buffer is not yet free, the driver would wait on the fence before using it.
On 2019-04-24 7:43 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 18:54 +0200, Michel Dänzer a écrit :
On 2019-04-24 5:44 p.m., Nicolas Dufresne wrote:
Le mercredi 24 avril 2019 à 17:06 +0200, Daniel Vetter a écrit :
On Wed, Apr 24, 2019 at 4:41 PM Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
On Wed, 2019-04-24 at 16:39 +0200, Michel Dänzer wrote:
On 2019-04-24 2:01 p.m., Nicolas Dufresne wrote: > Rendering a video stream is more complex then what you describe here. > Whenever there is a unexpected delay (late delivery of a frame as an > example) you may endup in situation where one frame is ready after the > targeted vblank. If there is another frame that targets the following > vblank that gets ready on-time, the previous frame should be replaced > by the most recent one. > > With fences, what happens is that even if you received the next frame > on time, naively replacing it is not possible, because we don't know > when the fence for the next frame will be signalled. If you simply > always replace the current frame, you may endup skipping a lot more > vblank then what you expect, and that results in jumpy playback.
So you want to be able to replace a queued flip with another one then. That doesn't necessarily require allowing more than one flip to be queued ahead of time.
There might be other ways to do it, but this one has plenty of advantages.
The point of kms (well one of the reasons) was to separate the implementation of modesetting for specific hw from policy decisions like which frames to drop and how to schedule them. Kernel gives tools, userspace implements the actual protocols.
There's definitely a bit a gap around scheduling flips for a specific frame or allowing to cancel/overwrite an already scheduled flip, but no one yet has come up with a clear proposal for new uapi + example implementation + userspace implementation + big enough support from other compositors that this is what they want too.
Actually, the ATOMIC_AMEND patches propose a way to replace a scheduled flip?
Note that this can also be done in userspace with explicit fencing (by only selecting a frame and submitting it to the kernel after all corresponding fences have signalled), at least to some degree, but the kernel should be able to do it up to a later point in time and more reliably, with less risk of missing a flip for a frame which becomes ready just in time.
Indeed, but it would be great if we could do that with implicit fencing as well.
- extract implicit fences from dma-buf. This part is just an idea,
but easy to implement once we have someone who actually wants this. All we need is a new ioctl on the dma-buf to export the fences from the reservation_object as a sync_file (either the exclusive or the shared ones, selected with a flag). 2. do the exact same frame scheduling as with explicit fencing 3. supply explicit fences in your atomic ioctl calls - these should overrule any implicit fences (assuming correct kernel drivers, but we have helpers so you can assume they all work correctly).
By design this is possible, it's just that no one yet bothered enough to make it happen. -Daniel
I'm not sure I understand the workflow of this one. I'm all in favour leaving the hard work to userspace. Note that I have assumed explicit fences from the start, I don't think implicit fence will ever exist in v4l2, but I might be wrong. What I understood is that there was a previous attempt in the past but it raised more issues then it actually solved. So that being said, how do handle exactly the follow use cases:
- A frame was lost by capture driver, but it was schedule as being the
next buffer to render (normally previous frame should remain).
Userspace just doesn't call into the kernel to flip to the lost frame, so the previous one remains.
We are stuck in a loop you a me. Considering v4l2 to drm, where fences don't exist on v4l2, it makes very little sense to bring up fences if we are to wait on the fence in userspace.
It makes sense insofar as no V4L specific code would be needed to make sure that the contents of a buffer produced via V4L aren't consumed before they're ready to be.
- The scheduled frame is late for the next vblank (didn't signal on-
time), a new one may be better for the next vlbank, but we will only know when it's fence is signaled.
Userspace only selects a frame and submits it to the kernel after all its fences have signalled.
Better in this context means the the presentation time of this frame is closer to the next vblank time. Keep in mind that the idea is to schedule the frames before they are signal, in order to make the usage of the fence useful in lowering the latency.
Fences are about signalling completion, not about low latency.
It can be used to remove a roundtrip with userspace at a very time sensitive moment. If you pass a dmabuf with it's unsignalled fence to a kernel driver, the driver can start the job on this dmabuf as soon as the fence is signalled. If you always wait on a fence in userspace, you have to wait for the userspace process to be scheduled,
I doubt this magically works without something like that (e.g. a workqueue, which runs in normal process context) in the kernel either. :)
then userspace will setup the drm atomic request or similar action, which may take some time and may require another process in the kernel to have to be schedule. This effectively adds some variable delay, a gap where nothing is happening between two operations. This time is lost and contributes to the overall operation latency.
It only increases latency if it causes a flip to miss its target vblank, and it's not possible to know this happens at an unacceptable rate without trying. The prudent approach is to at least prototype a solution with as much complexity as possible in userspace first. If that turns out to perform too badly, then we can think about how to improve it by adding complexity in the kernel.
The benefit of fences we are looking for is being able to setup before the fence is signalled the operations on various compatible drivers. This way, on the time critical moment a driver can be feed more jobs, there is no userspace rountrip involved.
That is possible with other operations, just not with page flipping yet.
Hi,
On Wed, 2019-04-24 at 10:31 +0200, Michel Dänzer wrote:
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
Definitely, it feels like the DRM display side is currently a good fit for render use cases, but not so much for precise display cases where we want to try and display a buffer at a given vblank target instead of "as soon as possible".
I have a userspace project where I've implemented a page flip queue, which only schedules the next flip when relevant and keeps ready buffers in the queue until then. This requires explicit vblank syncronisation (which DRM offsers, but pretty much all other display APIs, that are higher-level don't, so I'm just using a refresh-rate timer for them) and flip done notification.
I haven't looked too much at how to flip with a target vblank with DRM directly but maybe the atomic API already has the bits in for that (but I haven't heard of such a thing as a buffer queue, so that makes me doubt it).
Not directly. What's available is that if userspace waits for vblank n and then submits a flip, the flip will complete in vblank n+1 (or a later vblank, depending on when the flip is submitted and when the fences the flip depends on signal).
There is reluctance allowing more than one flip to be queued in the kernel, as it would considerably increase complexity in the kernel. It would probably only be considered if there was a compelling use-case which was outright impossible otherwise.
Well, I think it's just less boilerplace for userspace. This is indeed quite complex, and I prefer to see that complexity done once and well in Linux rather than duplicated in userspace with more or less reliable implementations.
Well, I need to handle stuff like SDL in my userspace project, so I have to have all that queuing stuff in software anyway, but it would be good if each project didn't have to implement that. Worst case, it could be in libdrm too.
Usually, this kind of queuing will be handled in a display server such as Xorg or a Wayland compositor, not by the application such as a video player itself, or any library in the latter's address space. I'm not sure there's much potential for sharing code between display servers for this.
This assumes that you are using a display server, which is definitely not always the case (there is e.g. Kodi GBM). Well, I'm not saying it is essential to have it in the kernel, but it would avoid code duplication and lower the complexity in userspace.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
Fences are about signalling that the contents of a frame are "done" and ready to be presented. They're not about specifying which frame is to be presented when.
Yes, that's precisely the issue I see with them. Once you have scheduled the flip with a buffer, it is too late to schedule a more recent buffer for flip if a more recent buffer is available sooner (see the issue that Nicolas is describing). If you attach a vblank target to the flip, the flip can be skipped when the fence is signaled if a more recent buffer was signaled first.
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
I still don't see any fence-based mechanism that can work to achieve that, but maybe I'm missing your point.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
I'm not very familiar with how this works, but I don't really see what it changes. Does it mean we can flip multiple times per vblank? If so, how can userspace be aware of that and deal with it properly? Unless I'm missing something, I think flip scheduling should still work on vblank granularity in that case.
And I really like a vblank count over a timestamp, as one is the native unit at hand and the other one only correleates to it.
Cheers,
Paul
On 2019-04-24 2:19 p.m., Paul Kocialkowski wrote:
On Wed, 2019-04-24 at 10:31 +0200, Michel Dänzer wrote:
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
It would be cool if both could be used concurrently and not just return -EBUSY when the device is used with the other subsystem.
We live in this world already :-) I think there's even patches (or merged already) to add fences to v4l, for Android.
This work is currently suspended. It will require some feature on DRM display to really make this useful, but there is also a lot of challanges in V4L2. In GFX space, most of the use case are about rendering as soon as possible. Though, in multimedia we have two problems, we need to synchronize the frame rendering with the audio, and output buffers may comes out of order due to how video CODECs are made.
Definitely, it feels like the DRM display side is currently a good fit for render use cases, but not so much for precise display cases where we want to try and display a buffer at a given vblank target instead of "as soon as possible".
I have a userspace project where I've implemented a page flip queue, which only schedules the next flip when relevant and keeps ready buffers in the queue until then. This requires explicit vblank syncronisation (which DRM offsers, but pretty much all other display APIs, that are higher-level don't, so I'm just using a refresh-rate timer for them) and flip done notification.
I haven't looked too much at how to flip with a target vblank with DRM directly but maybe the atomic API already has the bits in for that (but I haven't heard of such a thing as a buffer queue, so that makes me doubt it).
Not directly. What's available is that if userspace waits for vblank n and then submits a flip, the flip will complete in vblank n+1 (or a later vblank, depending on when the flip is submitted and when the fences the flip depends on signal).
There is reluctance allowing more than one flip to be queued in the kernel, as it would considerably increase complexity in the kernel. It would probably only be considered if there was a compelling use-case which was outright impossible otherwise.
Well, I think it's just less boilerplace for userspace. This is indeed quite complex, and I prefer to see that complexity done once and well in Linux rather than duplicated in userspace with more or less reliable implementations.
That's not the only trade-off to consider, e.g. I suspect handling this in the kernel is more complex than in userspace.
Well, I need to handle stuff like SDL in my userspace project, so I have to have all that queuing stuff in software anyway, but it would be good if each project didn't have to implement that. Worst case, it could be in libdrm too.
Usually, this kind of queuing will be handled in a display server such as Xorg or a Wayland compositor, not by the application such as a video player itself, or any library in the latter's address space. I'm not sure there's much potential for sharing code between display servers for this.
This assumes that you are using a display server, which is definitely not always the case (there is e.g. Kodi GBM). Well, I'm not saying it is essential to have it in the kernel, but it would avoid code duplication and lower the complexity in userspace.
For code duplication, my suggestion would be to use a display server instead of duplicating its functionality.
In the first, we'd need a mechanism where we can schedule a render at a specific time or vblank. We can of course already implement this in software, but with fences, the scheduling would need to be done in the driver. Then if the fence is signalled earlier, the driver should hold on until the delay is met. If the fence got signalled late, we also need to think of a workflow. As we can't schedule more then one render in DRM at one time, I don't really see yet how to make that work.
Indeed, that's also one of the main issues I've spotted. Before using an implicit fence, we basically have to make sure the frame is due for display at the next vblank. Otherwise, we need to refrain from using the fence and schedule the flip later, which is kind of counter- productive.
[...]
I feel like specifying a target vblank would be a good unit for that,
The mechanism described above works for that.
I still don't see any fence-based mechanism that can work to achieve that, but maybe I'm missing your point.
It's not fence based, just good old waiting for the previous vblank before submitting the flip to the kernel.
since it's our native granularity after all (while a timestamp is not).
Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync) changes things in this regard. It makes the vblank length variable, and if you wait for multiple vblanks between flips, you get the maximum vblank length corresponding to the minimum refresh rate / timing granularity. Thus, it would be useful to allow userspace to specify a timestamp corresponding to the earliest time when the flip is to complete. The kernel could then try to hit that as closely as possible.
I'm not very familiar with how this works, but I don't really see what it changes. Does it mean we can flip multiple times per vblank?
It's not about that.
And I really like a vblank count over a timestamp, as one is the native unit at hand and the other one only correleates to it.
From a video playback application POV it's really the other way around,
isn't it? The target time is known (e.g. in order to sync up with audio), the vblank count has to be calculated from that. And with variable refresh rate, this calculation can't be done reliably, because it's not known ahead of time when the next vblank starts (at least not more accurately than an interval corresponding to the maximum/minimum refresh rates).
If the target timestamp could be specified explicitly, the kernel could do the conversion to the vblank count for fixed refresh, and could adjust the refresh rate to hit the target more accurately with variable refresh.
On Wed, 17 Apr 2019 20:10:15 +0200 Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
There's also the possibility of writing up a drm-render DDX to handle these 2D blitters that can make things a lot faster when running a desktop environment. As for wayland, well, I don't really know what to think. I was under the impression that it relies on GL for 2D operations, but am really not sure how true that actually is.
Hi Paul,
Wayland does not rely on anything really, it does not even have any rendering commands, and is completely agnostic to how applications or display servers might be drawing things. Wayland (protocol) does care about buffer types and fences though, since those are the things passed between applications and servers.
In a Wayland architecture, each display server (called a Wayland compositor, corresponding to Xorg + window manager + compositing manager) uses whatever they want to use for putting the screen contents together. OpenGL is a popular choice, yes, but they may also use Vulkan, Pixman, Cairo, Skia, DRM KMS planes, and whatnot or a mix of any. Sometimes it may so happen that the display server does not need to render at all, the display hardware can realize the screen contents through e.g. KMS planes.
Writing a hardware specific driver (like a DDX for Xorg) for one display server (or a display server library like wlroots or libweston) is no longer reasonable. You would have to do it on so many display server projects. What really makes it infeasible is the hardware-specific aspect. People would have to write a driver for every display server project for every hardware model. That's just not feasible today.
Some display server projects even refuse to take hardware-specific code upstream, because keeping it working has a high cost and only very few people can test it.
The only way as I see that you could have Wayland compositors at large take advantage of 2D hardware units is to come up with the common userspace API in the sense similar to Vulkan or OpenGL, so that each display server would only need to support the API, and the API implementation would handle the hardware-specific parts. OpenWF by Khronos may have been the most serious effort in that, good luck finding any users or implementations today. Although maybe Android's hwcomposer could be the next one.
However, if someone is doing a special Wayland compositor to be used on specific hardware, they can of course use whatever to put the screen contents together in a downstream fork. Wayland does not restrict that in any way, not even by buffer or fence types because you can extend Wayland to deal with anything you need, as long as you also modify the apps or toolkits to do it too. The limitations are really more political and practical if you aim for upstream and wide-spread use of 2D hardware blocks.
Thanks, pq
Hi Pekka,
Le lundi 06 mai 2019 à 11:28 +0300, Pekka Paalanen a écrit :
On Wed, 17 Apr 2019 20:10:15 +0200 Paul Kocialkowski paul.kocialkowski@bootlin.com wrote:
There's also the possibility of writing up a drm-render DDX to handle these 2D blitters that can make things a lot faster when running a desktop environment. As for wayland, well, I don't really know what to think. I was under the impression that it relies on GL for 2D operations, but am really not sure how true that actually is.
Hi Paul,
Wayland does not rely on anything really, it does not even have any rendering commands, and is completely agnostic to how applications or display servers might be drawing things. Wayland (protocol) does care about buffer types and fences though, since those are the things passed between applications and servers.
In a Wayland architecture, each display server (called a Wayland compositor, corresponding to Xorg + window manager + compositing manager) uses whatever they want to use for putting the screen contents together. OpenGL is a popular choice, yes, but they may also use Vulkan, Pixman, Cairo, Skia, DRM KMS planes, and whatnot or a mix of any. Sometimes it may so happen that the display server does not need to render at all, the display hardware can realize the screen contents through e.g. KMS planes.
Right, I looked some more at wayland and had some discussions over IRC (come to think of it, I'm pretty sure you were in the discussions too) to get a clearer understanding of the architecture. The fact that the wayland protocol is render-agnostic and does not alloc buffers on its own feels very sane to me.
Writing a hardware specific driver (like a DDX for Xorg) for one display server (or a display server library like wlroots or libweston) is no longer reasonable. You would have to do it on so many display server projects. What really makes it infeasible is the hardware-specific aspect. People would have to write a driver for every display server project for every hardware model. That's just not feasible today.
Yes, this is why I am suggesting implementing a DRM helper library for that, which would handle common drivers. Basically what mesa does for 3D, but which a DRM-specific-but-device-agnostic userspace interface. So the overhead for integration in display servers would be minimal.
Some display server projects even refuse to take hardware-specific code upstream, because keeping it working has a high cost and only very few people can test it.
Right, maintainance aspects are quite importance and I think it's definitely best to centralize per-device support in a common library.
The only way as I see that you could have Wayland compositors at large take advantage of 2D hardware units is to come up with the common userspace API in the sense similar to Vulkan or OpenGL, so that each display server would only need to support the API, and the API implementation would handle the hardware-specific parts. OpenWF by Khronos may have been the most serious effort in that, good luck finding any users or implementations today. Although maybe Android's hwcomposer could be the next one.
I would be very cautious regarding the approach of designing a "standardized" API across systems. Most of the time, this does not work well and ends up involving a glue layer of crap that is not always a good fit for the system. Things more or less worked out with GL (with significant effort put into it), but there are countless other examples where it didn't (things like OpenMAX, OpenVG, etc).
In addition, this would mostly only be used in compositors, not in final applications, so the need to have a common API across systems is much reduced. There's also the fact that 2D is much less complicated than 3D.
So I am not very interested in this form of standardization and I think a DRM-specific userspace API for this is not only sufficient, but probably also the best fit for the job. Maybe the library implementing this API and device support could later be extended to support a standardized API across systems too if one shows up (a bit like mesa supports different state trackers). That's definitely not a personal priority though and I firmly believe it should not be a blocker to get 2D blitters support with DRM.
However, if someone is doing a special Wayland compositor to be used on specific hardware, they can of course use whatever to put the screen contents together in a downstream fork. Wayland does not restrict that in any way, not even by buffer or fence types because you can extend Wayland to deal with anything you need, as long as you also modify the apps or toolkits to do it too. The limitations are really more political and practical if you aim for upstream and wide-spread use of 2D hardware blocks.
Yes I understand that the issue is not so much on the technical side, but rather on governance and politics.
Cheers,
Paul
dri-devel@lists.freedesktop.org