How to design a DRM KMS driver exposing 2D compositing?

List overview All Threads
Download

newer

older

CONFIG_DMA_CMA causes ttm...

[Bug 68856] New: Rendering...

Pekka Paalanen

11 Aug 2014 11 Aug '14

10:38 a.m.

Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

Assuming the DRM universal planes and atomic mode setting / page flip infrastructure is in place, could the 2D compositing capabilities be exposed through universal planes? We can assume that plane properties are enough to describe all the compositing parameters.

Atomic updates are needed so that the complicated constraints can be checked, and user space can try to reduce the composition complexity if the kernel driver sees that it won't work.

Would it be feasible to generate a hundred identical non-primary planes to be exposed to user space via DRM?

If that could be done, the kernel driver could just use the existing kernel/user ABIs without having to invent something new, and programs like a Wayland compositor would not need to be coded specifically for this hardware.

What problems do you see with this plan? Are any of those problems unfixable or simply prohibitive?

I have some concerns, which I am not sure will actually be a problem: - Does allocating a 100 planes eat too much kernel memory? I mean just the bookkeeping, properties, etc. - Would such an amount of planes make some in-kernel algorithms slow (particularly in DRM common code)? - Considering how user space discovers all DRM resources, would this make a compositor "slow" to start?

I suppose whether these turn out to be prohibitive or not, one just has to implement it and see. It should be usable on a slowish CPU with unimpressive amounts of RAM, because that is where a separate 2D compositing engine gives the most kick.

FWIW, dynamically created/destroyed planes would probably not be the answer. The kernel driver cannot decide before-hand how many planes it can expose. How many planes can be used depends completely on how user space decides to use them. Therefore I believe it should expose the maximum number always, whether there is any real use case that could actually get them all running or not.

What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Thanks, pq

Show replies by date

Damien Lespiau

11 Aug 11 Aug

10:57 a.m.

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...

Hi,

Hi,

...

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

This looks like a fallback that would use GL to compose the intermediate buffer. Any reason why that fallback can't be kicked from userspace?

-- Damien

Pekka Paalanen

12:07 p.m.

On Mon, 11 Aug 2014 11:57:10 +0100 Damien Lespiau damien.lespiau@intel.com wrote:

...

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

Hi,

...
there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

This looks like a fallback that would use GL to compose the intermediate buffer. Any reason why that fallback can't be kicked from userspace?

It is not GL, and GL might not be available or desireable. It is still the same 2D compositing engine in hardware, but now running with off-screen target buffer, because it cannot anymore keep up with the continous pixel rate that the direct scanout would need.

If we were to use the 2D compositing engine from user space, we would be on the road to OpenWFC. IOW, there is no standard API for the user space to use yet, as far as I'm aware. ;-)

I'm just trying to avoid having to design a kernel driver ABI for a user space driver, then design/implement some standard user space API on top, and then go fix all compositors to actually use it instead of / with KMS.

Thanks, pq

Damien Lespiau

1:14 p.m.

On Mon, Aug 11, 2014 at 03:07:33PM +0300, Pekka Paalanen wrote:

...

...
...
there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

This looks like a fallback that would use GL to compose the intermediate buffer. Any reason why that fallback can't be kicked from userspace?

It is not GL, and GL might not be available or desireable. It is still the same 2D compositing engine in hardware, but now running with off-screen target buffer, because it cannot anymore keep up with the continous pixel rate that the direct scanout would need.

I didn't mean this was GL, but just making the parallel, ie. we wouldn't put a GL fallback into the kernel.

...

If we were to use the 2D compositing engine from user space, we would be on the road to OpenWFC. IOW, there is no standard API for the user space to use yet, as far as I'm aware. ;-)

I'm just trying to avoid having to design a kernel driver ABI for a user space driver, then design/implement some standard user space API on top, and then go fix all compositors to actually use it instead of / with KMS.

It's no easy trade-off. For instance, if the compositor doesn't know about some of the hw constraints you are talking about, it may ask the kernel for a configuration that suddently will only allow 20 fps updates (because of the bw limitation you're mentioning). And the compositor just wouldn't know.

I can only speak for the hw I know, if you want to squeeze everything you can from that simple (compared to the one you're talking about) display hw, there's no choice, the compositor needs to know about the constraints to make clever decisions (that's what we do on Android). But then the appeal of a common interface is understandable.

(An answer that doesn't actually say anything interesting, oh well),

-- Damien

Pekka Paalanen

1:44 p.m.

On Mon, 11 Aug 2014 14:14:56 +0100 Damien Lespiau damien.lespiau@intel.com wrote:

...

On Mon, Aug 11, 2014 at 03:07:33PM +0300, Pekka Paalanen wrote:

...
...
...
there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

This looks like a fallback that would use GL to compose the intermediate buffer. Any reason why that fallback can't be kicked from userspace?

It is not GL, and GL might not be available or desireable. It is still the same 2D compositing engine in hardware, but now running with off-screen target buffer, because it cannot anymore keep up with the continous pixel rate that the direct scanout would need.

I didn't mean this was GL, but just making the parallel, ie. we wouldn't put a GL fallback into the kernel.

...
If we were to use the 2D compositing engine from user space, we would be on the road to OpenWFC. IOW, there is no standard API for the user space to use yet, as far as I'm aware. ;-)

I'm just trying to avoid having to design a kernel driver ABI for a user space driver, then design/implement some standard user space API on top, and then go fix all compositors to actually use it instead of / with KMS.

It's no easy trade-off. For instance, if the compositor doesn't know about some of the hw constraints you are talking about, it may ask the kernel for a configuration that suddently will only allow 20 fps updates (because of the bw limitation you're mentioning). And the compositor just wouldn't know.

Sure, but it would still be much better than the actual fallback in the compositor in user space, if we cannot drive the 2D engine from user space.

KMS works the same way already: if you have GL rendering that just runs for too long, your final pageflip using it will implicitly get delayed that much. Does it not?

...

I can only speak for the hw I know, if you want to squeeze everything you can from that simple (compared to the one you're talking about) display hw, there's no choice, the compositor needs to know about the constraints to make clever decisions (that's what we do on Android). But then the appeal of a common interface is understandable.

(An answer that doesn't actually say anything interesting, oh well),

Yeah... so it comes down to deciding at what point will the kernel driver say "this won't fly, do something else". And danvet has a pretty solid answer to that, I think.

Thanks, pq

Daniel Vetter

12:06 p.m.

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...

Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

...

Assuming the DRM universal planes and atomic mode setting / page flip infrastructure is in place, could the 2D compositing capabilities be exposed through universal planes? We can assume that plane properties are enough to describe all the compositing parameters.

Atomic updates are needed so that the complicated constraints can be checked, and user space can try to reduce the composition complexity if the kernel driver sees that it won't work.

Would it be feasible to generate a hundred identical non-primary planes to be exposed to user space via DRM?

If that could be done, the kernel driver could just use the existing kernel/user ABIs without having to invent something new, and programs like a Wayland compositor would not need to be coded specifically for this hardware.

What problems do you see with this plan? Are any of those problems unfixable or simply prohibitive?

I have some concerns, which I am not sure will actually be a problem:

Does allocating a 100 planes eat too much kernel memory? I mean just the bookkeeping, properties, etc.

Would such an amount of planes make some in-kernel algorithms slow (particularly in DRM common code)?

Considering how user space discovers all DRM resources, would this make a compositor "slow" to start?

I don't see any problem with that. We have a few plane-loops, but iirc those can be easily fixed to use indices and similar stuff. The atomic ioctl itself should scale nicely.

...

I suppose whether these turn out to be prohibitive or not, one just has to implement it and see. It should be usable on a slowish CPU with unimpressive amounts of RAM, because that is where a separate 2D compositing engine gives the most kick.

FWIW, dynamically created/destroyed planes would probably not be the answer. The kernel driver cannot decide before-hand how many planes it can expose. How many planes can be used depends completely on how user space decides to use them. Therefore I believe it should expose the maximum number always, whether there is any real use case that could actually get them all running or not.

Yeah dynamic planes doesn't sound like a nice solution, least because you'll get to audit piles of code. Currently really only framebuffers (and to some extent connectors) can come and go freely in kms-land.

...

What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to internally render to a buffer and then scan that one out due to lack of memory bandwidth or so that very much sounds like a rendering api. Ofc stuff like writeback buffers blurry that a bit. But hw writeback is still real-time.

...

Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

Then expose everything else with a separate api. I guess you'll just end up with per-compositor userspace drivers due to the lack of a widespread 2d api. OpenVG is kinda dead, and cairo might not fit. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Pekka Paalanen

12:47 p.m.

Hi Daniel,

you make perfect sense as usual. :-) Comments below.

On Mon, 11 Aug 2014 14:06:36 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

Yes.

...

...
Assuming the DRM universal planes and atomic mode setting / page flip infrastructure is in place, could the 2D compositing capabilities be exposed through universal planes? We can assume that plane properties are enough to describe all the compositing parameters.

Atomic updates are needed so that the complicated constraints can be checked, and user space can try to reduce the composition complexity if the kernel driver sees that it won't work.

Would it be feasible to generate a hundred identical non-primary planes to be exposed to user space via DRM?

If that could be done, the kernel driver could just use the existing kernel/user ABIs without having to invent something new, and programs like a Wayland compositor would not need to be coded specifically for this hardware.

What problems do you see with this plan? Are any of those problems unfixable or simply prohibitive?

I have some concerns, which I am not sure will actually be a problem:

Does allocating a 100 planes eat too much kernel memory? I mean just the bookkeeping, properties, etc.

Would such an amount of planes make some in-kernel algorithms slow (particularly in DRM common code)?

Considering how user space discovers all DRM resources, would this make a compositor "slow" to start?

I don't see any problem with that. We have a few plane-loops, but iirc those can be easily fixed to use indices and similar stuff. The atomic ioctl itself should scale nicely.

Very nice.

...

...
I suppose whether these turn out to be prohibitive or not, one just has to implement it and see. It should be usable on a slowish CPU with unimpressive amounts of RAM, because that is where a separate 2D compositing engine gives the most kick.

FWIW, dynamically created/destroyed planes would probably not be the answer. The kernel driver cannot decide before-hand how many planes it can expose. How many planes can be used depends completely on how user space decides to use them. Therefore I believe it should expose the maximum number always, whether there is any real use case that could actually get them all running or not.

Yeah dynamic planes doesn't sound like a nice solution, least because you'll get to audit piles of code. Currently really only framebuffers (and to some extent connectors) can come and go freely in kms-land.

Yup, thought so.

...

...
What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to internally render to a buffer and then scan that one out due to lack of memory bandwidth or so that very much sounds like a rendering api. Ofc stuff like writeback buffers blurry that a bit. But hw writeback is still real-time.

Agreed, that's a good and clear definition, even if it might make my life harder.

I'm still not completely sure, that using an intermediate buffer means sacrificing real-time (i.e. being able to hit the next vblank the user space is aiming for) performance, maybe the 2D engine output rate fluctuates so that the scanout block would have problems but a buffer can still be completed in time. Anyway, details.

Would using an intermediate buffer be ok if we can still maintain real-time? That is, say, if a compositor kicks the atomic update e.g. 7 ms before vblank, we would still hit it even with the intermediate buffer? If that is actually possible, I don't know yet.

...

...
Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

I do not yet know where that real-time limit is, but I'm guessing it could be pretty low. If it is, we might start hitting software compositing (like Pixman) very often, which is too slow to be usable.

Defining z-order and blending sounds like peanuts compared to below.

...

Then expose everything else with a separate api. I guess you'll just end up with per-compositor userspace drivers due to the lack of a widespread 2d api. OpenVG is kinda dead, and cairo might not fit.

Yeah, that is kind of the worst case, which also seems unavoidable.

Thanks, pq

Daniel Vetter

3:35 p.m.

On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:

...

...
...
What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to internally render to a buffer and then scan that one out due to lack of memory bandwidth or so that very much sounds like a rendering api. Ofc stuff like writeback buffers blurry that a bit. But hw writeback is still real-time.

Agreed, that's a good and clear definition, even if it might make my life harder.

I'm still not completely sure, that using an intermediate buffer means sacrificing real-time (i.e. being able to hit the next vblank the user space is aiming for) performance, maybe the 2D engine output rate fluctuates so that the scanout block would have problems but a buffer can still be completed in time. Anyway, details.

Would using an intermediate buffer be ok if we can still maintain real-time? That is, say, if a compositor kicks the atomic update e.g. 7 ms before vblank, we would still hit it even with the intermediate buffer? If that is actually possible, I don't know yet.

I guess you could hide this in the kernel if you want. After all the entire point of kms is to shovel the memory management into the kernel driver's responsibility. But I agree with Rob that if there are intermediate buffers, it would be fairly neat to let userspace know about them.

So I don't think the intermediate buffer thing would be a no-go for kms, but I suspect that will only happen when the videocore can't hit the next frame reliably. And that kind of stutter is imo not good for a kms driver. I guess you could forgo vblank timestamp support and just go with super-variable scanout times, but I guess that will make the video playback people unhappy - they already bitch about the sub 1% inaccuracy we have in our hdmi clocks.

...

...
...
Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

I do not yet know where that real-time limit is, but I'm guessing it could be pretty low. If it is, we might start hitting software compositing (like Pixman) very often, which is too slow to be usable.

Well for other drivers/stacks we'd fall back to GL compositing. pixman would obviously be terribly. Curious question: Can you provoke the hw/firmware to render into abitrary buffers or does it only work together with real display outputs?

So I guess the real question is: What kind of interface does videocore provide? Note that kms framebuffers are super-flexible and you're freee to add your own ioctl for special framebuffers which are rendered live by the vc. So that might be a possible way to expose this if you can't tell the vc which buffers to render into explicitly.

...

Defining z-order and blending sounds like peanuts compared to below.

...
Then expose everything else with a separate api. I guess you'll just end up with per-compositor userspace drivers due to the lack of a widespread 2d api. OpenVG is kinda dead, and cairo might not fit.

Yeah, that is kind of the worst case, which also seems unavoidable.

Yeah, there's no universal 2d accel standard at all. Which sucks for hw that can't do full gl. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Ville Syrjälä

4:09 p.m.

On Mon, Aug 11, 2014 at 05:35:31PM +0200, Daniel Vetter wrote:

...

On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:

...
...
...
What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to internally render to a buffer and then scan that one out due to lack of memory bandwidth or so that very much sounds like a rendering api. Ofc stuff like writeback buffers blurry that a bit. But hw writeback is still real-time.

Agreed, that's a good and clear definition, even if it might make my life harder.

I'm still not completely sure, that using an intermediate buffer means sacrificing real-time (i.e. being able to hit the next vblank the user space is aiming for) performance, maybe the 2D engine output rate fluctuates so that the scanout block would have problems but a buffer can still be completed in time. Anyway, details.

Would using an intermediate buffer be ok if we can still maintain real-time? That is, say, if a compositor kicks the atomic update e.g. 7 ms before vblank, we would still hit it even with the intermediate buffer? If that is actually possible, I don't know yet.

I guess you could hide this in the kernel if you want. After all the entire point of kms is to shovel the memory management into the kernel driver's responsibility. But I agree with Rob that if there are intermediate buffers, it would be fairly neat to let userspace know about them.

So I don't think the intermediate buffer thing would be a no-go for kms, but I suspect that will only happen when the videocore can't hit the next frame reliably. And that kind of stutter is imo not good for a kms driver. I guess you could forgo vblank timestamp support and just go with super-variable scanout times, but I guess that will make the video playback people unhappy - they already bitch about the sub 1% inaccuracy we have in our hdmi clocks.

...
...
...
Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

I do not yet know where that real-time limit is, but I'm guessing it could be pretty low. If it is, we might start hitting software compositing (like Pixman) very often, which is too slow to be usable.

Well for other drivers/stacks we'd fall back to GL compositing. pixman would obviously be terribly. Curious question: Can you provoke the hw/firmware to render into abitrary buffers or does it only work together with real display outputs?

So I guess the real question is: What kind of interface does videocore provide? Note that kms framebuffers are super-flexible and you're freee to add your own ioctl for special framebuffers which are rendered live by the vc. So that might be a possible way to expose this if you can't tell the vc which buffers to render into explicitly.

We should maybe think about exposing this display engine writeback stuff in some decent way. Maybe a property on the crtc (or plane when doing per-plane writeback) where you attach a target framebuffer for the write. And some virtual connectors/encoders to satisfy the kms API requirements.

With DSI command mode I suppose it would be possible to even mix display and writeback uses of the same hardware pipeline so that the writeback doesn't disturb the display. But I'm not sure there would any nice way to expose that in kms. Maybe just expose two crtcs, one for writeback and one for display and multiplex in the driver.

-- Ville Syrjälä Intel OTC

Daniel Vetter

5:21 p.m.

On Mon, Aug 11, 2014 at 07:09:11PM +0300, Ville Syrjälä wrote:

...

On Mon, Aug 11, 2014 at 05:35:31PM +0200, Daniel Vetter wrote:

...
On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:

...
...
...
What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to internally render to a buffer and then scan that one out due to lack of memory bandwidth or so that very much sounds like a rendering api. Ofc stuff like writeback buffers blurry that a bit. But hw writeback is still real-time.

Agreed, that's a good and clear definition, even if it might make my life harder.

I'm still not completely sure, that using an intermediate buffer means sacrificing real-time (i.e. being able to hit the next vblank the user space is aiming for) performance, maybe the 2D engine output rate fluctuates so that the scanout block would have problems but a buffer can still be completed in time. Anyway, details.

Would using an intermediate buffer be ok if we can still maintain real-time? That is, say, if a compositor kicks the atomic update e.g. 7 ms before vblank, we would still hit it even with the intermediate buffer? If that is actually possible, I don't know yet.

I guess you could hide this in the kernel if you want. After all the entire point of kms is to shovel the memory management into the kernel driver's responsibility. But I agree with Rob that if there are intermediate buffers, it would be fairly neat to let userspace know about them.

So I don't think the intermediate buffer thing would be a no-go for kms, but I suspect that will only happen when the videocore can't hit the next frame reliably. And that kind of stutter is imo not good for a kms driver. I guess you could forgo vblank timestamp support and just go with super-variable scanout times, but I guess that will make the video playback people unhappy - they already bitch about the sub 1% inaccuracy we have in our hdmi clocks.

...
...
...
Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

I do not yet know where that real-time limit is, but I'm guessing it could be pretty low. If it is, we might start hitting software compositing (like Pixman) very often, which is too slow to be usable.

Well for other drivers/stacks we'd fall back to GL compositing. pixman would obviously be terribly. Curious question: Can you provoke the hw/firmware to render into abitrary buffers or does it only work together with real display outputs?

So I guess the real question is: What kind of interface does videocore provide? Note that kms framebuffers are super-flexible and you're freee to add your own ioctl for special framebuffers which are rendered live by the vc. So that might be a possible way to expose this if you can't tell the vc which buffers to render into explicitly.

We should maybe think about exposing this display engine writeback stuff in some decent way. Maybe a property on the crtc (or plane when doing per-plane writeback) where you attach a target framebuffer for the write. And some virtual connectors/encoders to satisfy the kms API requirements.

With DSI command mode I suppose it would be possible to even mix display and writeback uses of the same hardware pipeline so that the writeback doesn't disturb the display. But I'm not sure there would any nice way to expose that in kms. Maybe just expose two crtcs, one for writeback and one for display and multiplex in the driver.

Another idea was to punt this to v4l, at least for the fancier hw which can do a lot of crazy video signal routing ... -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Pekka Paalanen

12 Aug 12 Aug

7:10 a.m.

On Mon, 11 Aug 2014 17:35:31 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...

Well for other drivers/stacks we'd fall back to GL compositing. pixman would obviously be terribly. Curious question: Can you provoke the hw/firmware to render into abitrary buffers or does it only work together with real display outputs?

Since we have been talking about on-line (direct to output) and off-line (buffer target) use of the HVS (2D compositing engine), it should be able to do both I think.

...

So I guess the real question is: What kind of interface does videocore provide? Note that kms framebuffers are super-flexible and you're freee to add your own ioctl for special framebuffers which are rendered live by the vc. So that might be a possible way to expose this if you can't tell the vc which buffers to render into explicitly.

Right. I don't know the HVS details yet, but I'm hoping we can tell it to render into a custom buffer, like the 3D core can.

This discussion is very helpful btw, I'm starting to see some possible plans.

Thanks, pq

Rob Clark

11 Aug 11 Aug

1:32 p.m.

On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter daniel@ffwll.ch wrote:

...

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

...
Assuming the DRM universal planes and atomic mode setting / page flip infrastructure is in place, could the 2D compositing capabilities be exposed through universal planes? We can assume that plane properties are enough to describe all the compositing parameters.

Atomic updates are needed so that the complicated constraints can be checked, and user space can try to reduce the composition complexity if the kernel driver sees that it won't work.

Would it be feasible to generate a hundred identical non-primary planes to be exposed to user space via DRM?

If that could be done, the kernel driver could just use the existing kernel/user ABIs without having to invent something new, and programs like a Wayland compositor would not need to be coded specifically for this hardware.

What problems do you see with this plan? Are any of those problems unfixable or simply prohibitive?

I have some concerns, which I am not sure will actually be a problem:

Does allocating a 100 planes eat too much kernel memory? I mean just the bookkeeping, properties, etc.

Would such an amount of planes make some in-kernel algorithms slow (particularly in DRM common code)?

Considering how user space discovers all DRM resources, would this make a compositor "slow" to start?

I don't see any problem with that. We have a few plane-loops, but iirc those can be easily fixed to use indices and similar stuff. The atomic ioctl itself should scale nicely.

...
I suppose whether these turn out to be prohibitive or not, one just has to implement it and see. It should be usable on a slowish CPU with unimpressive amounts of RAM, because that is where a separate 2D compositing engine gives the most kick.

FWIW, dynamically created/destroyed planes would probably not be the answer. The kernel driver cannot decide before-hand how many planes it can expose. How many planes can be used depends completely on how user space decides to use them. Therefore I believe it should expose the maximum number always, whether there is any real use case that could actually get them all running or not.

Yeah dynamic planes doesn't sound like a nice solution, least because you'll get to audit piles of code. Currently really only framebuffers (and to some extent connectors) can come and go freely in kms-land.

...
What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to internally render to a buffer and then scan that one out due to lack of memory bandwidth or so that very much sounds like a rendering api. Ofc stuff like writeback buffers blurry that a bit. But hw writeback is still real-time.

not really sure how much of this is exposed to the cpu side, vs hidden on coproc..

but I tend to think it would be nice for compositors (userspace) to know explicitly what is going on.. ie. if some layers are blended via intermediate buffer, couldn't that intermediate buffer be potentially re-used on next frame if not damaged?

...

...
Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

Then expose everything else with a separate api. I guess you'll just end up with per-compositor userspace drivers due to the lack of a widespread 2d api. OpenVG is kinda dead, and cairo might not fit.

I kind of suspect someone should really just design weston2d, an api more explicitly for compositing.. model after OpenWFC if that fits nicely. Or not if it doesn't. Or just use the existing weston front-end/back-end split..

I expect other wayland compositors would want more or less the same thing as weston (barring pre-existing layer-cake mess.. cough, cough, cogl/clutter/gnome-shell..)

We could even make a gallium statetracker implementation of weston2d to get some usage on desktop..

BR, -R

...

-Daniel

Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Daniel Vetter

3:24 p.m.

On Mon, Aug 11, 2014 at 09:32:32AM -0400, Rob Clark wrote:

...

On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter daniel@ffwll.ch wrote:

...
Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

Then expose everything else with a separate api. I guess you'll just end up with per-compositor userspace drivers due to the lack of a widespread 2d api. OpenVG is kinda dead, and cairo might not fit.

I kind of suspect someone should really just design weston2d, an api more explicitly for compositing.. model after OpenWFC if that fits nicely. Or not if it doesn't. Or just use the existing weston front-end/back-end split..

I expect other wayland compositors would want more or less the same thing as weston (barring pre-existing layer-cake mess.. cough, cough, cogl/clutter/gnome-shell..)

We could even make a gallium statetracker implementation of weston2d to get some usage on desktop..

There's vega already in mesa .... It just looks terribly unused. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Pekka Paalanen

12 Aug 12 Aug

7:20 a.m.

On Mon, 11 Aug 2014 09:32:32 -0400 Rob Clark robdclark@gmail.com wrote:

...

On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter daniel@ffwll.ch wrote:

...
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to internally render to a buffer and then scan that one out due to lack of memory bandwidth or so that very much sounds like a rendering api. Ofc stuff like writeback buffers blurry that a bit. But hw writeback is still real-time.

not really sure how much of this is exposed to the cpu side, vs hidden on coproc..

but I tend to think it would be nice for compositors (userspace) to know explicitly what is going on.. ie. if some layers are blended via intermediate buffer, couldn't that intermediate buffer be potentially re-used on next frame if not damaged?

Very true, and I think that speaks for exposing the HVS explicitly to user space to be directly used. That way I believe the user space could track damage and composite only the minimum, rather than everything every time which I suppose the KMS API approach would imply.

We don't have dirty regions in KMS API/props, do we? But yeah, that is starting to feel like a stretch to push through KMS.

...

...
...
Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can reap the usual benefits planes bring wrt video-playback and stuff like that). So perhaps something in line with what current hw does in hw and then double it a bit or twice - 16 planes or so. Your driver would reject any requests that need intermediate buffers to store render results. I.e. everything that can't be scanned out directly in real-time at about 60fps. The fun with kms planes is also that right now we have 0 standards for z-ordering and blending. So would need to define that first.

Then expose everything else with a separate api. I guess you'll just end up with per-compositor userspace drivers due to the lack of a widespread 2d api. OpenVG is kinda dead, and cairo might not fit.

I kind of suspect someone should really just design weston2d, an api more explicitly for compositing.. model after OpenWFC if that fits nicely. Or not if it doesn't. Or just use the existing weston front-end/back-end split..

I expect other wayland compositors would want more or less the same thing as weston (barring pre-existing layer-cake mess.. cough, cough, cogl/clutter/gnome-shell..)

We could even make a gallium statetracker implementation of weston2d to get some usage on desktop..

Yeah. I suppose I should aim for whatever driver-specific interface we need for the HVS to be used from user space, use that in Weston, and get a feeling of what might be a nice, driver-agnostic 2D compositing API.

Thanks, pq

Daniel Vetter

8:03 a.m.

On Tue, Aug 12, 2014 at 9:20 AM, Pekka Paalanen ppaalanen@gmail.com wrote:

...

...
but I tend to think it would be nice for compositors (userspace) to know explicitly what is going on.. ie. if some layers are blended via intermediate buffer, couldn't that intermediate buffer be potentially re-used on next frame if not damaged?

Very true, and I think that speaks for exposing the HVS explicitly to user space to be directly used. That way I believe the user space could track damage and composite only the minimum, rather than everything every time which I suppose the KMS API approach would imply.

We don't have dirty regions in KMS API/props, do we? But yeah, that is starting to feel like a stretch to push through KMS.

We have the dirty-ioctl, but imo it's a bit misdesigned: It works at the framebuffer level (so the driver always has to figure out which crtc/plane this is about), and it only works for frontbuffer rendering. It was essentially a single-purpose thing for udl uploads.

But in generally I think it would make tons of sense to supply a per-crtc (or maybe per-plane) damage rect with nuclear flips. Both mipi dsi and edp have provisions to upload a subrect, so this could be useful in general. And decent compositors compute this already anyway. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Ville Syrjälä

10:04 a.m.

On Tue, Aug 12, 2014 at 10:03:26AM +0200, Daniel Vetter wrote:

...

On Tue, Aug 12, 2014 at 9:20 AM, Pekka Paalanen ppaalanen@gmail.com wrote:

...
...
but I tend to think it would be nice for compositors (userspace) to know explicitly what is going on.. ie. if some layers are blended via intermediate buffer, couldn't that intermediate buffer be potentially re-used on next frame if not damaged?

Very true, and I think that speaks for exposing the HVS explicitly to user space to be directly used. That way I believe the user space could track damage and composite only the minimum, rather than everything every time which I suppose the KMS API approach would imply.

We don't have dirty regions in KMS API/props, do we? But yeah, that is starting to feel like a stretch to push through KMS.

We have the dirty-ioctl, but imo it's a bit misdesigned: It works at the framebuffer level (so the driver always has to figure out which crtc/plane this is about), and it only works for frontbuffer rendering. It was essentially a single-purpose thing for udl uploads.

But in generally I think it would make tons of sense to supply a per-crtc (or maybe per-plane) damage rect with nuclear flips. Both mipi dsi and edp have provisions to upload a subrect, so this could be useful in general. And decent compositors compute this already anyway.

Agreed, as long as we make it more of a hint so that the driver is allowed to expand the rect to satisfy hardware specific alignment requirements and whatnot.

I think a single per-crtc rect should be enough, but in case people would like to implement a more sophisticated multi-rect update I suppose we could allow it. And for those that don't want the extra complexity of trying to deal with multiple rectangles, the driver could just calculate the bounding rectangle and update that.

-- Ville Syrjälä Intel OTC

Eric Anholt

11 Aug 11 Aug

5:16 p.m.

Daniel Vetter daniel@ffwll.ch writes:

...

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

Pekka wasn't sure if things were confidential here, but I can say it: Yeah, it's the RPi.

While I haven't written code using the compositor interface (I just did enough to shim in a single plane for bringup, and I'm hoping Pekka and company can handle the rest for me :) ), my understanding is that the way you make use of it is that you've got your previous frame loaded up in the HVS (the plane compositor hardware), then when you're asked to put up a new frame that's going to be too hard, you take some complicated chunk of your scene and ask the HVS to use any spare bandwidth it has while it's still scanning out the previous frame in order to composite that piece of new scene into memory. Then, when it's done with the offline composite, you ask the HVS to do the next scanout frame using the original scene with the pre-composited temporary buffer.

I'm pretty comfortable with the idea of having some large number of planes preallocated, and deciding that "nobody could possibly need more than 16" (or whatever).

My initial reaction to "we should just punt when we run out of bandwidth and have a special driver interface for offline composite" was "that's awful, when the kernel could just get the job done immediately, and easily, and it would know exactly what it needed to composite to get things to fit (unlike userspace)". I'm trying to come up with what benefit there would be to having a separate interface for offline composite. I've got 3 things:

- Avoids having a potentially long, interruptible wait in the modeset path while the offline composite happens. But I think we have other interruptible waits in that path alreaady.

- Userspace could potentially do something else besides use the HVS to get the fallback done. Video would have to use the HVS, to get the same scaling filters applied as the previous frame where things *did* fit, but I guess you could composite some 1:1 RGBA overlays in GL, which would have more BW available to it than what you're borrowing from the previous frame's HVS capacity.

- Userspace could potentially use the offline composite interface for things besides just the running-out-of-bandwidth case. Like, it was doing a nicely-filtered downscale of an overlaid video, then the user hit pause and walked away: you could have a timeout that noticed that the complicated scene hadn't changed in a while, and you'd drop from overlays to a HVS-composited single plane to reduce power.

The third one is the one I've actually found kind of compelling, and might be switching me from wanting no userspace visibility into the fallback. But I don't have a good feel for how much complexity there is to our descriptions of planes, and how much poorly-tested interface we'd be adding to support this usecase.

(Because, honestly, I don't expect the fallbacks to be hit much -- my understanding of the bandwidth equation is that you're mostly counting the number of pixels that have to be read, and clipped-out pixels because somebody's overlaid on top of you don't count unless they're in the same burst read. So unless people are going nuts with blending in overlays, or downscaled video, it's probably not a problem, and something that gets your pixels on the screen at all is sufficient)

Daniel Vetter

5:27 p.m.

On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:

...

Daniel Vetter daniel@ffwll.ch writes:

...
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

Pekka wasn't sure if things were confidential here, but I can say it: Yeah, it's the RPi.

While I haven't written code using the compositor interface (I just did enough to shim in a single plane for bringup, and I'm hoping Pekka and company can handle the rest for me :) ), my understanding is that the way you make use of it is that you've got your previous frame loaded up in the HVS (the plane compositor hardware), then when you're asked to put up a new frame that's going to be too hard, you take some complicated chunk of your scene and ask the HVS to use any spare bandwidth it has while it's still scanning out the previous frame in order to composite that piece of new scene into memory. Then, when it's done with the offline composite, you ask the HVS to do the next scanout frame using the original scene with the pre-composited temporary buffer.

I'm pretty comfortable with the idea of having some large number of planes preallocated, and deciding that "nobody could possibly need more than 16" (or whatever).

My initial reaction to "we should just punt when we run out of bandwidth and have a special driver interface for offline composite" was "that's awful, when the kernel could just get the job done immediately, and easily, and it would know exactly what it needed to composite to get things to fit (unlike userspace)". I'm trying to come up with what benefit there would be to having a separate interface for offline composite. I've got 3 things:

Avoids having a potentially long, interruptible wait in the modeset path while the offline composite happens. But I think we have other interruptible waits in that path alreaady.

Userspace could potentially do something else besides use the HVS to get the fallback done. Video would have to use the HVS, to get the same scaling filters applied as the previous frame where things *did* fit, but I guess you could composite some 1:1 RGBA overlays in GL, which would have more BW available to it than what you're borrowing from the previous frame's HVS capacity.

Userspace could potentially use the offline composite interface for things besides just the running-out-of-bandwidth case. Like, it was doing a nicely-filtered downscale of an overlaid video, then the user hit pause and walked away: you could have a timeout that noticed that the complicated scene hadn't changed in a while, and you'd drop from overlays to a HVS-composited single plane to reduce power.

The third one is the one I've actually found kind of compelling, and might be switching me from wanting no userspace visibility into the fallback. But I don't have a good feel for how much complexity there is to our descriptions of planes, and how much poorly-tested interface we'd be adding to support this usecase.

Compositor should already do a rough bw guesstimate and if stuff doesn't change any more bake the entire scene into a single framebuffer. The exact same issue happens on more usual hw with video overlays, too.

Ofc if it turns out that scanning out your yuv planes is less bw then the overlay shouldn't be stopped ofc. But imo there's nothing special here for the rpi.

...

(Because, honestly, I don't expect the fallbacks to be hit much -- my understanding of the bandwidth equation is that you're mostly counting the number of pixels that have to be read, and clipped-out pixels because somebody's overlaid on top of you don't count unless they're in the same burst read. So unless people are going nuts with blending in overlays, or downscaled video, it's probably not a problem, and something that gets your pixels on the screen at all is sufficient)

Yeah I guess we need to check reality here. If the "we've run out of bw" case just never happens then it's pointless to write special code for it. And we can always add a limit later for the case where GL is usually better and tell userspace that we can't do this many planes. Exact same thing with running out of memory bw can happen anywhere else, too. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Pekka Paalanen

12 Aug 12 Aug

8:48 a.m.

On Mon, 11 Aug 2014 19:27:45 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...

On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:

...
Daniel Vetter daniel@ffwll.ch writes:

...
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

Pekka wasn't sure if things were confidential here, but I can say it: Yeah, it's the RPi.

While I haven't written code using the compositor interface (I just did enough to shim in a single plane for bringup, and I'm hoping Pekka and company can handle the rest for me :) ), my understanding is that the way you make use of it is that you've got your previous frame loaded up in the HVS (the plane compositor hardware), then when you're asked to put up a new frame that's going to be too hard, you take some complicated chunk of your scene and ask the HVS to use any spare bandwidth it has while it's still scanning out the previous frame in order to composite that piece of new scene into memory. Then, when it's done with the offline composite, you ask the HVS to do the next scanout frame using the original scene with the pre-composited temporary buffer.

I'm pretty comfortable with the idea of having some large number of planes preallocated, and deciding that "nobody could possibly need more than 16" (or whatever).

My initial reaction to "we should just punt when we run out of bandwidth and have a special driver interface for offline composite" was "that's awful, when the kernel could just get the job done immediately, and easily, and it would know exactly what it needed to composite to get things to fit (unlike userspace)". I'm trying to come up with what benefit there would be to having a separate interface for offline composite. I've got 3 things:

Avoids having a potentially long, interruptible wait in the modeset path while the offline composite happens. But I think we have other interruptible waits in that path alreaady.

Userspace could potentially do something else besides use the HVS to get the fallback done. Video would have to use the HVS, to get the same scaling filters applied as the previous frame where things *did* fit, but I guess you could composite some 1:1 RGBA overlays in GL, which would have more BW available to it than what you're borrowing from the previous frame's HVS capacity.

Userspace could potentially use the offline composite interface for things besides just the running-out-of-bandwidth case. Like, it was doing a nicely-filtered downscale of an overlaid video, then the user hit pause and walked away: you could have a timeout that noticed that the complicated scene hadn't changed in a while, and you'd drop from overlays to a HVS-composited single plane to reduce power.

The third one is the one I've actually found kind of compelling, and might be switching me from wanting no userspace visibility into the fallback. But I don't have a good feel for how much complexity there is to our descriptions of planes, and how much poorly-tested interface we'd be adding to support this usecase.

Compositor should already do a rough bw guesstimate and if stuff doesn't change any more bake the entire scene into a single framebuffer. The exact same issue happens on more usual hw with video overlays, too.

Ofc if it turns out that scanning out your yuv planes is less bw then the overlay shouldn't be stopped ofc. But imo there's nothing special here for the rpi.

...
(Because, honestly, I don't expect the fallbacks to be hit much -- my understanding of the bandwidth equation is that you're mostly counting the number of pixels that have to be read, and clipped-out pixels because somebody's overlaid on top of you don't count unless they're in the same burst read. So unless people are going nuts with blending in overlays, or downscaled video, it's probably not a problem, and something that gets your pixels on the screen at all is sufficient)

Yeah I guess we need to check reality here. If the "we've run out of bw" case just never happens then it's pointless to write special code for it. And we can always add a limit later for the case where GL is usually better and tell userspace that we can't do this many planes. Exact same thing with running out of memory bw can happen anywhere else, too.

I had a chat with Eric last night, and our different views about the on-line/real-time performance limits of the HVS seem to be due to alpha blending.

Eric has not been using alpha blending much or at all, while my experiments with Weston and DispmanX pretty much always need alpha blending (e.g. because DispmanX cannot say that only a sub-region of a buffer needs blending). Eric says alpha blending kills the performance.

This makes me think that maybe I should expose only one or two (cursor?) planes with alpha blending formats, and all other planes with only opaque formats. That would naturally limit the compositor's use of planes to cases where it probably matters most: cursors, and opaque video and openGL surfaces.

Then all the alpha-blended stuff will hit the fallback... which is... Pixman at the moment (thinking about Weston here) until Eric gets the GLESv2 flying. :-/

That means that doing a driver-specific kernel/user ABI for using the HVS seems required. Write driver-specific libdrm API for it, use it directly in Weston, see what falls out later.

Thanks, pq

Eric Anholt

4:10 p.m.

Pekka Paalanen ppaalanen@gmail.com writes:

...

On Mon, 11 Aug 2014 19:27:45 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...
On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:

...
Daniel Vetter daniel@ffwll.ch writes:

...
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

Pekka wasn't sure if things were confidential here, but I can say it: Yeah, it's the RPi.

While I haven't written code using the compositor interface (I just did enough to shim in a single plane for bringup, and I'm hoping Pekka and company can handle the rest for me :) ), my understanding is that the way you make use of it is that you've got your previous frame loaded up in the HVS (the plane compositor hardware), then when you're asked to put up a new frame that's going to be too hard, you take some complicated chunk of your scene and ask the HVS to use any spare bandwidth it has while it's still scanning out the previous frame in order to composite that piece of new scene into memory. Then, when it's done with the offline composite, you ask the HVS to do the next scanout frame using the original scene with the pre-composited temporary buffer.

I'm pretty comfortable with the idea of having some large number of planes preallocated, and deciding that "nobody could possibly need more than 16" (or whatever).

My initial reaction to "we should just punt when we run out of bandwidth and have a special driver interface for offline composite" was "that's awful, when the kernel could just get the job done immediately, and easily, and it would know exactly what it needed to composite to get things to fit (unlike userspace)". I'm trying to come up with what benefit there would be to having a separate interface for offline composite. I've got 3 things:

Avoids having a potentially long, interruptible wait in the modeset path while the offline composite happens. But I think we have other interruptible waits in that path alreaady.

Userspace could potentially do something else besides use the HVS to get the fallback done. Video would have to use the HVS, to get the same scaling filters applied as the previous frame where things *did* fit, but I guess you could composite some 1:1 RGBA overlays in GL, which would have more BW available to it than what you're borrowing from the previous frame's HVS capacity.

Userspace could potentially use the offline composite interface for things besides just the running-out-of-bandwidth case. Like, it was doing a nicely-filtered downscale of an overlaid video, then the user hit pause and walked away: you could have a timeout that noticed that the complicated scene hadn't changed in a while, and you'd drop from overlays to a HVS-composited single plane to reduce power.

The third one is the one I've actually found kind of compelling, and might be switching me from wanting no userspace visibility into the fallback. But I don't have a good feel for how much complexity there is to our descriptions of planes, and how much poorly-tested interface we'd be adding to support this usecase.

Compositor should already do a rough bw guesstimate and if stuff doesn't change any more bake the entire scene into a single framebuffer. The exact same issue happens on more usual hw with video overlays, too.

Ofc if it turns out that scanning out your yuv planes is less bw then the overlay shouldn't be stopped ofc. But imo there's nothing special here for the rpi.

...
(Because, honestly, I don't expect the fallbacks to be hit much -- my understanding of the bandwidth equation is that you're mostly counting the number of pixels that have to be read, and clipped-out pixels because somebody's overlaid on top of you don't count unless they're in the same burst read. So unless people are going nuts with blending in overlays, or downscaled video, it's probably not a problem, and something that gets your pixels on the screen at all is sufficient)

Yeah I guess we need to check reality here. If the "we've run out of bw" case just never happens then it's pointless to write special code for it. And we can always add a limit later for the case where GL is usually better and tell userspace that we can't do this many planes. Exact same thing with running out of memory bw can happen anywhere else, too.

I had a chat with Eric last night, and our different views about the on-line/real-time performance limits of the HVS seem to be due to alpha blending.

Eric has not been using alpha blending much or at all, while my experiments with Weston and DispmanX pretty much always need alpha blending (e.g. because DispmanX cannot say that only a sub-region of a buffer needs blending). Eric says alpha blending kills the performance.

Note, I wasn't saying anything about performance. I was just talking about how compositing in X knows that (almost) everything is actually opaque, so I don't have the worries about alpha blending that you apparently do in Weston.

Pekka Paalanen

13 Aug 13 Aug

7:02 a.m.

On Tue, 12 Aug 2014 09:10:47 -0700 Eric Anholt eric@anholt.net wrote:

...

Pekka Paalanen ppaalanen@gmail.com writes:

...
On Mon, 11 Aug 2014 19:27:45 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...
On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:

...
Daniel Vetter daniel@ffwll.ch writes:

...
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least something similar?

Pekka wasn't sure if things were confidential here, but I can say it: Yeah, it's the RPi.

While I haven't written code using the compositor interface (I just did enough to shim in a single plane for bringup, and I'm hoping Pekka and company can handle the rest for me :) ), my understanding is that the way you make use of it is that you've got your previous frame loaded up in the HVS (the plane compositor hardware), then when you're asked to put up a new frame that's going to be too hard, you take some complicated chunk of your scene and ask the HVS to use any spare bandwidth it has while it's still scanning out the previous frame in order to composite that piece of new scene into memory. Then, when it's done with the offline composite, you ask the HVS to do the next scanout frame using the original scene with the pre-composited temporary buffer.

I'm pretty comfortable with the idea of having some large number of planes preallocated, and deciding that "nobody could possibly need more than 16" (or whatever).

My initial reaction to "we should just punt when we run out of bandwidth and have a special driver interface for offline composite" was "that's awful, when the kernel could just get the job done immediately, and easily, and it would know exactly what it needed to composite to get things to fit (unlike userspace)". I'm trying to come up with what benefit there would be to having a separate interface for offline composite. I've got 3 things:

Avoids having a potentially long, interruptible wait in the modeset path while the offline composite happens. But I think we have other interruptible waits in that path alreaady.

Userspace could potentially do something else besides use the HVS to get the fallback done. Video would have to use the HVS, to get the same scaling filters applied as the previous frame where things *did* fit, but I guess you could composite some 1:1 RGBA overlays in GL, which would have more BW available to it than what you're borrowing from the previous frame's HVS capacity.

Userspace could potentially use the offline composite interface for things besides just the running-out-of-bandwidth case. Like, it was doing a nicely-filtered downscale of an overlaid video, then the user hit pause and walked away: you could have a timeout that noticed that the complicated scene hadn't changed in a while, and you'd drop from overlays to a HVS-composited single plane to reduce power.

The third one is the one I've actually found kind of compelling, and might be switching me from wanting no userspace visibility into the fallback. But I don't have a good feel for how much complexity there is to our descriptions of planes, and how much poorly-tested interface we'd be adding to support this usecase.

Compositor should already do a rough bw guesstimate and if stuff doesn't change any more bake the entire scene into a single framebuffer. The exact same issue happens on more usual hw with video overlays, too.

Ofc if it turns out that scanning out your yuv planes is less bw then the overlay shouldn't be stopped ofc. But imo there's nothing special here for the rpi.

...
(Because, honestly, I don't expect the fallbacks to be hit much -- my understanding of the bandwidth equation is that you're mostly counting the number of pixels that have to be read, and clipped-out pixels because somebody's overlaid on top of you don't count unless they're in the same burst read. So unless people are going nuts with blending in overlays, or downscaled video, it's probably not a problem, and something that gets your pixels on the screen at all is sufficient)

Yeah I guess we need to check reality here. If the "we've run out of bw" case just never happens then it's pointless to write special code for it. And we can always add a limit later for the case where GL is usually better and tell userspace that we can't do this many planes. Exact same thing with running out of memory bw can happen anywhere else, too.

I had a chat with Eric last night, and our different views about the on-line/real-time performance limits of the HVS seem to be due to alpha blending.

Eric has not been using alpha blending much or at all, while my experiments with Weston and DispmanX pretty much always need alpha blending (e.g. because DispmanX cannot say that only a sub-region of a buffer needs blending). Eric says alpha blending kills the performance.

Note, I wasn't saying anything about performance. I was just talking about how compositing in X knows that (almost) everything is actually opaque, so I don't have the worries about alpha blending that you apparently do in Weston.

Ok, I'm confused.

Most surfaces in Weston do have non-opaque parts, usually the window decorations, depending of course on the desktop visual style in use. That means almost no surface is completely opaque, the wallpaper being the obvious exception.

In Weston, we also do have the opaque region as set by apps as a hint, that these regions do not need alpha blending. However with DispmanX, there was no way to make use of the opaque region markup unless it covered the whole surface.

Well, I could have split every window into 5 DispmanX elements instead of just one (4 blended, 1 opaque) to approximate the usual case with decorations, but I never tried that. There was some concern, that the number of elements would become the dominating limit on how much can be on screen at once, so it didn't feel worth the added complexity, and enabling the automatic fallback to off-line just worked.

Alpha-blending can still be forced to a whole window by desktop effects, though.

Does this explain why I saw that with DispmanX, the HVS on-line mode would fail to reliably drive the output with just one or two basic app windows open if even that much? IIRC that was on a 1280x1024 monitor, not even close to a full-HD.

Thanks, pq

Matt Roper

11 Aug 11 Aug

2:37 p.m.

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...

Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

Is your requirement that this needs to be transparent to all userspace or just transparent to your display server (e.g., Weston)? I'm wondering whether it might be easier to write a libdrm interposer that intercepts any libdrm calls dealing with planes and exposes a bunch of additional "virtual" planes to the display server when queried. When you submit an atomic ioctl, your interposer will figure out the best strategy to make that happen given the real hardware available on your system and will try to blend some of your excess buffers via whatever userspace API's are available (Cairo, GLES, OpenVG, etc.). This would keep kernel complexity down and allow easier debugging and tuning.

Matt

...

These 2D compositing features should be exposed to user space through a standard kernel ABI, hopefully an existing ABI in the very near future like the KMS atomic.

Assuming the DRM universal planes and atomic mode setting / page flip infrastructure is in place, could the 2D compositing capabilities be exposed through universal planes? We can assume that plane properties are enough to describe all the compositing parameters.

Atomic updates are needed so that the complicated constraints can be checked, and user space can try to reduce the composition complexity if the kernel driver sees that it won't work.

Would it be feasible to generate a hundred identical non-primary planes to be exposed to user space via DRM?

If that could be done, the kernel driver could just use the existing kernel/user ABIs without having to invent something new, and programs like a Wayland compositor would not need to be coded specifically for this hardware.

What problems do you see with this plan? Are any of those problems unfixable or simply prohibitive?

I have some concerns, which I am not sure will actually be a problem:

Does allocating a 100 planes eat too much kernel memory? I mean just the bookkeeping, properties, etc.

Would such an amount of planes make some in-kernel algorithms slow (particularly in DRM common code)?

Considering how user space discovers all DRM resources, would this make a compositor "slow" to start?

I suppose whether these turn out to be prohibitive or not, one just has to implement it and see. It should be usable on a slowish CPU with unimpressive amounts of RAM, because that is where a separate 2D compositing engine gives the most kick.

FWIW, dynamically created/destroyed planes would probably not be the answer. The kernel driver cannot decide before-hand how many planes it can expose. How many planes can be used depends completely on how user space decides to use them. Therefore I believe it should expose the maximum number always, whether there is any real use case that could actually get them all running or not.

What if I cannot even pick a maximum number of planes, but wanted to (as the hardware allows) let the 2D compositing scale up basically unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really, rather than a KMS API, so it's probably out of scope. Where is the line between KMS 2D compositing with planes vs. 2D composite rendering?

Should I really be designing a driver-specific compositing API instead, similar to what the Mesa OpenGL implementations use? Then have user space maybe use the user space driver part via OpenWFC perhaps? And when I mention OpenWFC, you probably notice, that I am not aware of any standard user space API I could be implementing here. ;-)

Thanks, pq _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Matt Roper Graphics Software Engineer IoTG Platform Enabling & Development Intel Corporation (916) 356-2795

Pekka Paalanen

12 Aug 12 Aug

8:42 a.m.

On Mon, 11 Aug 2014 07:37:18 -0700 Matt Roper matthew.d.roper@intel.com wrote:

...

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:

...
Hi,

there is some hardware than can do 2D compositing with an arbitrary number of planes. I'm not sure what the absolute maximum number of planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size, etc. planes can be used at once. A driver would be able to check those before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is able to composite on the fly in scanout, just like the usual overlay hardware blocks in CRTCs. When the composition complexity goes up, the driver can fall back to compositing into a buffer rather than on the fly in scanout. This fallback needs to be completely transparent to the user space, implying only additional latency if anything.

Is your requirement that this needs to be transparent to all userspace or just transparent to your display server (e.g., Weston)? I'm wondering whether it might be easier to write a libdrm interposer that intercepts any libdrm calls dealing with planes and exposes a bunch of additional "virtual" planes to the display server when queried. When you submit an atomic ioctl, your interposer will figure out the best strategy to make that happen given the real hardware available on your system and will try to blend some of your excess buffers via whatever userspace API's are available (Cairo, GLES, OpenVG, etc.). This would keep kernel complexity down and allow easier debugging and tuning.

That's an inventive proposition. ;-)

I would still need to design the kernel/user ABI for the HVS (the 2D engine). As I am starting to believe, that the "non-real-time" use of the HVS does not belong behind the KMS API, we might as well just do things more properly, and expose it with a real user space API eventually.

Thanks, pq

3927

Age (days ago)

3929

Last active (days ago)

dri-devel@lists.freedesktop.org

22 comments

7 participants

tags (0)

participants (7)

Damien Lespiau
Daniel Vetter
Eric Anholt
Matt Roper
Pekka Paalanen
Rob Clark
Ville Syrjälä