Hi folks,
I'm currently thinking about adding an hw-accelerated bitblt operation. The idea goes like this:
* we add some bitblt ioctl which copies rects between bo's. (it also handles memory layouts, pixfmt conversion, etc) * the driver can decide to let the GPU or IPU do that, if available * if we have an suitable DMA engine (maybe only the more complex ones which can handle lines on their own ...) we'll use that * as fallback, resort to memcpy().
Whether an dma engine can/should be used might be highly hw specific, so that probably would be configured in DT.
To use that feature, userland could actually allocate two BO's, one that's mapped as a framebuffer to some crtc, another one just a memory buffer. It could then render to the fast memory buffer and tell the DRM to only copy over the changed regions to the graphics memory via DMA (or whatever is best on that particular hw platform).
What do you think about that idea ?
--mtx
On Tue, Aug 02, 2016 at 03:21:08PM +0200, Enrico Weigelt, metux IT consult wrote:
Hi folks,
I'm currently thinking about adding an hw-accelerated bitblt operation. The idea goes like this:
- we add some bitblt ioctl which copies rects between bo's. (it also handles memory layouts, pixfmt conversion, etc)
- the driver can decide to let the GPU or IPU do that, if available
- if we have an suitable DMA engine (maybe only the more complex ones which can handle lines on their own ...) we'll use that
- as fallback, resort to memcpy().
Whether an dma engine can/should be used might be highly hw specific, so that probably would be configured in DT.
To use that feature, userland could actually allocate two BO's, one that's mapped as a framebuffer to some crtc, another one just a memory buffer. It could then render to the fast memory buffer and tell the DRM to only copy over the changed regions to the graphics memory via DMA (or whatever is best on that particular hw platform).
What do you think about that idea ?
If you mean "add a generic hw-accelerated bitblt operation": This is not hw drm works. The generic kms stuff is about display only, with just very basic (hence "dumb") buffer allocation support in a generic way.
If you mean "expose the dma engine I have here to userspace in driver-private ioctls with the trade-off logic between that, kms compositing using the display block and memcpy in userspace", then go ahead ;-) But if you do that, pls don't don't forget that for any uapi the drm subsytem requires correspoding open source userspace (in a real app/compositor, not just some toy test or something similar).
Cheers, Daniel
On 02.08.2016 16:04, Daniel Vetter wrote:
If you mean "add a generic hw-accelerated bitblt operation": This is not hw drm works. The generic kms stuff is about display only, with just very basic (hence "dumb") buffer allocation support in a generic way.
Well, if it already does buffer allocation and mapping (which might also involve copying around phyisical buffers), why not also add copy-between-buffers ?
If you mean "expose the dma engine I have here to userspace in driver-private ioctls with the trade-off logic between that, kms compositing using the display block and memcpy in userspace", then go ahead ;-) But if you do that, pls don't don't forget that for any uapi the drm subsytem requires correspoding open source userspace (in a real app/compositor, not just some toy test or something similar).
I dont intent to add yet another specific driver and driver-specific ioctl()s, but instead a generic interface. Such stuff needs kernel support and kernel configuration anyways, so I'd like to keep it out of userland's business.
--mtx
On Tue, Aug 2, 2016 at 5:43 PM, Enrico Weigelt, metux IT consult enrico.weigelt@gr13.net wrote:
On 02.08.2016 16:04, Daniel Vetter wrote:
If you mean "add a generic hw-accelerated bitblt operation": This is not hw drm works. The generic kms stuff is about display only, with just very basic (hence "dumb") buffer allocation support in a generic way.
Well, if it already does buffer allocation and mapping (which might also involve copying around phyisical buffers), why not also add copy-between-buffers ?
except "dumb" buffers exist *only* for CPU rendered content, you cannot assume that a gpu can accelerate anything with them.
They basically exist just for simple splash screens and fbcon
If you mean "expose the dma engine I have here to userspace in driver-private ioctls with the trade-off logic between that, kms compositing using the display block and memcpy in userspace", then go ahead ;-) But if you do that, pls don't don't forget that for any uapi the drm subsytem requires correspoding open source userspace (in a real app/compositor, not just some toy test or something similar).
I dont intent to add yet another specific driver and driver-specific ioctl()s, but instead a generic interface. Such stuff needs kernel support and kernel configuration anyways, so I'd like to keep it out of userland's business.
there is a reason that there is no generic gpu cmd submission ioctl. It is too much hw specific, and anyway it is only used by device specific userspace (ie. gl driver and/or xorg ddx)
BR, -R
--mtx
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 03.08.2016 01:12, Rob Clark wrote:
Hi,
Well, if it already does buffer allocation and mapping (which might also involve copying around phyisical buffers), why not also add copy-between-buffers ?
except "dumb" buffers exist *only* for CPU rendered content, you cannot assume that a gpu can accelerate anything with them.
Exactly my usecase: having no (usable) GPU at all, but a an sdma controller - or even better: an IPU - which can do the bitblt. (maybe even w/ colorspace conversion, rotation, etc)
There might be GPUs which can also do that - and in that case it should be done by the GPU.
They basically exist just for simple splash screens and fbcon
Or when you dont have an (usable) GPU at all ?
there is a reason that there is no generic gpu cmd submission ioctl. It is too much hw specific,
Sure, but I'm not going to use an GPU at all, but different hw.
and anyway it is only used by device specific userspace (ie. gl driver and/or xorg ddx)
Actually, on my targets I neither have gl nor xorg, and I'd like to keep userland generic. I'd hate to hate to have lots of hw-specific cairo-backends when I'll have to touch the kernel anyways, in order to use smda or ipu.
By the way: while hacking a bit on mesa (backporting to Trusty), I came around separate hw-specific calls for retrieving the video memory size. Seems to be a really common thing ... is there any hw that does not have such thing ? Couldn't that be an generic ioctl() ?
I somewhat got the strange feeling that anything that goes beyond very trivial dumb framebuffer has hw-specific ioctl's ;-o
--mtx
On 3 August 2016 at 13:33, Enrico Weigelt, metux IT consult enrico.weigelt@gr13.net wrote:
On 03.08.2016 01:12, Rob Clark wrote:
Hi,
Well, if it already does buffer allocation and mapping (which might also involve copying around phyisical buffers), why not also add copy-between-buffers ?
except "dumb" buffers exist *only* for CPU rendered content, you cannot assume that a gpu can accelerate anything with them.
Exactly my usecase: having no (usable) GPU at all, but a an sdma controller - or even better: an IPU - which can do the bitblt. (maybe even w/ colorspace conversion, rotation, etc)
There might be GPUs which can also do that - and in that case it should be done by the GPU.
They basically exist just for simple splash screens and fbcon
Or when you dont have an (usable) GPU at all ?
there is a reason that there is no generic gpu cmd submission ioctl. It is too much hw specific,
Sure, but I'm not going to use an GPU at all, but different hw.
and anyway it is only used by device specific userspace (ie. gl driver and/or xorg ddx)
Actually, on my targets I neither have gl nor xorg, and I'd like to keep userland generic. I'd hate to hate to have lots of hw-specific cairo-backends when I'll have to touch the kernel anyways, in order to use smda or ipu.
By the way: while hacking a bit on mesa (backporting to Trusty), I came around separate hw-specific calls for retrieving the video memory size. Seems to be a really common thing ... is there any hw that does not have such thing ? Couldn't that be an generic ioctl() ?
I somewhat got the strange feeling that anything that goes beyond very trivial dumb framebuffer has hw-specific ioctl's ;-o
The thing isstuff looks generic until you go to use it, just abstract it in userspace.
Because no hw is the same once you go beyond that.
Video memory size means what? VRAM, GPU accessible system RAM, amount of CPU visible VRAM?
Dave.
On 03.08.2016 05:47, Dave Airlie wrote:
Because no hw is the same once you go beyond that.
hmm, it doesn't seem to be so extremly different, that we cant at least abstract some common aspects.
Video memory size means what? VRAM, GPU accessible system RAM, amount of CPU visible VRAM?
Actually, these are separate things, which of course should be reported in separate fields:
* phys_aperture_size: --> physical maximum for the shared ram between cpu and gpu (cpu-accessible gpu-memory) * avail_aperture_size: --> the logical maximum that the process can map --> might be lower than phys_..., eg. due to process limits or when running a 32bit userland on 64bit kernel * phys_gpu_memory_size: --> the total size of gpu's memory (that could be accessed by cpu) --> might be larger than phys_aperture_size / avail_aperture_size when gpu just has more memory than can be shared w/ cpu --> eg. an interesting indicator on how much can be filled w/ readonly textures (which dont need to be cpu-accessible anymore) * avail_gpu_memory_size: --> the logical maximum that process can consume * phys_shm_size: --> max size of shared system memory (directly accessible b both gpu and cpu) --> commonly available on SoCs - on other hw might be zero --> not counting on-board RAM that is hw-mapped to the GPU, thus not falling into system memory in the first place.
IMHO, that should catch all usual scenarios, from the fat gamer-GPU boards to tiny SoCs ... did I miss something here ?
In the end, these values only seem to be used as some statistics for the userland's decision on much stuff it uploads to the GPU.
By the way: what about resource limits ? Can we control, how much GPU memory an unprivileged process can consume, in order to prevent DOS'ing other processes (even other users) ?
--mtx
Hi Enrico,
On 2016-08-02 15:21, Enrico Weigelt, metux IT consult wrote:
I'm currently thinking about adding an hw-accelerated bitblt operation. The idea goes like this:
- we add some bitblt ioctl which copies rects between bo's. (it also handles memory layouts, pixfmt conversion, etc)
- the driver can decide to let the GPU or IPU do that, if available
- if we have an suitable DMA engine (maybe only the more complex ones which can handle lines on their own ...) we'll use that
- as fallback, resort to memcpy().
Whether an dma engine can/should be used might be highly hw specific, so that probably would be configured in DT.
To use that feature, userland could actually allocate two BO's, one that's mapped as a framebuffer to some crtc, another one just a memory buffer. It could then render to the fast memory buffer and tell the DRM to only copy over the changed regions to the graphics memory via DMA (or whatever is best on that particular hw platform).
What do you think about that idea ?
I'm working now on something similar, but more generic. There is already a framework for picture processing (converting, scaling, blitting, rotating) in Exynos DRM. It is called IPP (Image Post Processing), but its user interface is really ugly and limited, so I plan to rewrite it and make it really generic. Some discussion on it were already in the following thread: http://thread.gmane.org/gmane.linux.kernel.samsung-soc/49743
I plan to propose an API based on DRM object/properties, which will be similar to KMS atomic API. I will let you know when I have it ready for presenting in public.
Best regards
On Wed, Aug 03, 2016 at 11:24:37AM +0200, Marek Szyprowski wrote:
Hi Enrico,
On 2016-08-02 15:21, Enrico Weigelt, metux IT consult wrote:
I'm currently thinking about adding an hw-accelerated bitblt operation. The idea goes like this:
- we add some bitblt ioctl which copies rects between bo's. (it also handles memory layouts, pixfmt conversion, etc)
- the driver can decide to let the GPU or IPU do that, if available
- if we have an suitable DMA engine (maybe only the more complex ones which can handle lines on their own ...) we'll use that
- as fallback, resort to memcpy().
Whether an dma engine can/should be used might be highly hw specific, so that probably would be configured in DT.
To use that feature, userland could actually allocate two BO's, one that's mapped as a framebuffer to some crtc, another one just a memory buffer. It could then render to the fast memory buffer and tell the DRM to only copy over the changed regions to the graphics memory via DMA (or whatever is best on that particular hw platform).
What do you think about that idea ?
I'm working now on something similar, but more generic. There is already a framework for picture processing (converting, scaling, blitting, rotating) in Exynos DRM. It is called IPP (Image Post Processing), but its user interface is really ugly and limited, so I plan to rewrite it and make it really generic. Some discussion on it were already in the following thread: http://thread.gmane.org/gmane.linux.kernel.samsung-soc/49743
I plan to propose an API based on DRM object/properties, which will be similar to KMS atomic API. I will let you know when I have it ready for presenting in public.
In case it's not clear from Dave's, Rob's and my reply: Generic rendering of any kind is _very_ unpopular in the drm subsystem. We've tried semi-generic 15 years ago (with some of the shared drm core stuff between linux and bsd) and it's a disaster of fake generic, single-use code.
The reason for that is that hw accel is actually not simple. You essentially need to have as little additional abstraction between what's your real client api (hw composer, Xrender or whatever it is) and the hw. Because for optimal performance you _must_ supply the commands to the kernel in an as close to the format/layout used by the hardware as possible. That means no shared command submission of any kind. And the other reason is that cache transfers and memory transfers are highly hardware specific, too. Which means no shared buffer management and mapping interfaces either.
In short, if you want to get this in you need to disprove the last 15-20 years of linux gfx driver developement and show that we've been wrong on these. Expect _very_ high resistence to anything remotely looking like a shared/common blitter uapi. Of course having some common helper code to make drivers easier to type (like cma helpers, or ttm, or similar) is something entirely different, this is about the uapi.
And please don't be discourage here, I just want to set clear expectations to avoid disappointment. Supporting blitter hardware is obviously a good idea, and I think the drm subsystem is the right place for that (especially if you have a display block or sometimes a real gpu connected to that blitter).
Cheers, Daniel
On 03.08.2016 13:47, Daniel Vetter wrote:
Because for optimal performance you _must_ supply the commands to the kernel in an as close to the format/layout used by the hardware as possible. That means no shared command submission of any kind. And the other reason is that cache transfers and memory transfers are highly hardware specific, too. Which means no shared buffer management and mapping interfaces either.
Right, but I wonder whether that applies to my case. Again, I'm talking about using aux IPs (not the actual GPU) for things like copying image regions, maybe even pixfmt/colospace conversions - those things, in embedded world, usually aren't done by the gpu, but separate IPs.
Of course having some common helper code to make drivers easier to type (like cma helpers, or ttm, or similar) is something entirely different, this is about the uapi.
Well, I'm actually talking about an uapi, as userland somehow needs to call it :p
Doing it in specific drivers doesn't seem to be a good ways, as sooner or later we'd have to implement that into lots of different drivers (plus corresponding userland support), as it's pretty orthogonal to GPU, as well as fbs/crtcs. Just in some cases, it **might** also be done via GPU, if applicable (maybe only when its idle anyways), but that's not the usual case. Instead the usual case would be employing some DMA controller or IPU.
And please don't be discourage here, I just want to set clear expectations to avoid disappointment. Supporting blitter hardware is obviously a good idea, and I think the drm subsystem is the right place for that (especially if you have a display block or sometimes a real gpu connected to that blitter).
Okay, where else should we put it ? Invent an entirely new device for that ?
--mtx
On Thu, Aug 04, 2016 at 01:32:57AM +0200, Enrico Weigelt, metux IT consult wrote:
On 03.08.2016 13:47, Daniel Vetter wrote:
Because for optimal performance you _must_ supply the commands to the kernel in an as close to the format/layout used by the hardware as possible. That means no shared command submission of any kind. And the other reason is that cache transfers and memory transfers are highly hardware specific, too. Which means no shared buffer management and mapping interfaces either.
Right, but I wonder whether that applies to my case. Again, I'm talking about using aux IPs (not the actual GPU) for things like copying image regions, maybe even pixfmt/colospace conversions - those things, in embedded world, usually aren't done by the gpu, but separate IPs.
15+ years ago gpus weren't much more than fancy blitters either ;-)
Of course having some common helper code to make drivers easier to type (like cma helpers, or ttm, or similar) is something entirely different, this is about the uapi.
Well, I'm actually talking about an uapi, as userland somehow needs to call it :p
Doing it in specific drivers doesn't seem to be a good ways, as sooner or later we'd have to implement that into lots of different drivers (plus corresponding userland support), as it's pretty orthogonal to GPU, as well as fbs/crtcs. Just in some cases, it **might** also be done via GPU, if applicable (maybe only when its idle anyways), but that's not the usual case. Instead the usual case would be employing some DMA controller or IPU.
One problem with 2d blitters is that there's no common userspace interface, but many: Xrender, hwc, old X drawing api, various attempts by khronos to standardize something, cairo, ... It's probably worse than video decoding even, and definitely not like on the 3d side where there's GL (and now vulkan) and that's it.
So you you'll end up with tons of glue code everywhere anyway. Adding yet another kernel uapi doesn't help, but forcing it to be generic will make sure it's inefficient. Which means someone else then will create another one.
And please don't be discourage here, I just want to set clear expectations to avoid disappointment. Supporting blitter hardware is obviously a good idea, and I think the drm subsystem is the right place for that (especially if you have a display block or sometimes a real gpu connected to that blitter).
Okay, where else should we put it ? Invent an entirely new device for that ?
If the blitter is always attached to the display block just add a few gem based ioctls there (like with desktop gpus) for submitting blit workloads. Otherwise new driver I guess.
Either case it'll probably be a bit more painful than a kms driver, since on the gem side the helpers aren't that full-featured (yet). -Daniel
Hi,
On 4 August 2016 at 08:50, Daniel Vetter daniel@ffwll.ch wrote:
One problem with 2d blitters is that there's no common userspace interface, but many: Xrender, hwc, old X drawing api, various attempts by khronos to standardize something, cairo, ... It's probably worse than video decoding even, and definitely not like on the 3d side where there's GL (and now vulkan) and that's it.
Running with the same theme, a unified API would only be meaningfully useful if you have unified userspace support. As soon as you hit the usual issues of needing to blit to/from special buffer types, weird format restrictions, chained operations which can affect performance enough to make you avoid or heavily favour certain types of operations, etc etc, you'll need separate userspace code to handle them. And at that point, sticking it behind a unified API doesn't really bring any value.
Other prior art you could look at is the Renesas VSP1/VSP2 hardware, which works through V4L2 and its media controller.
Cheers, Daniel
On 04.08.2016 09:50, Daniel Vetter wrote:
Hi,
One problem with 2d blitters is that there's no common userspace interface, but many: Xrender, hwc, old X drawing api, various attempts by khronos to standardize something, cairo, ...
We're talking about userland APIs, not kernel->userland interfaces. For userland APIs, I'm right now primarily interested in cairo (using it for my tiny widget toolkit) ... but I'm also thinking about setting X ontop of someting cairo-alike some day - or making gallium that layer.
It's probably worse than video decoding even, and definitely not like on the 3d side where there's GL (and now vulkan) and that's it.
On video side we have v4l for the kernel interface and gst as userland framework ... looks like a good compromise to me.
So you you'll end up with tons of glue code everywhere anyway.
Actually, I'd like to get the glue code smaller. Putting both cairo and X onto the common driver base (something that's somewhere between xorg video drivers and cairo surface backends) seems a good way to go, even though there'll be a lot of work to do for that.
Adding yet another kernel uapi doesn't help, but forcing it to be generic will make sure it's inefficient. Which means someone else then will create another one.
hmm, I'm not yet convinced that it necessarily will be inefficient.
To clarify the scope: I'm talking only about _dedicated_ units, which are completely orthogonal to complex gpus (basicly, just specialized dma controllers).
I personally don't care so much whether it's in DRM, V4L or whatever. DRM just seemed to be a good place to me.
By the way: as the number of such controllers increases, for dozens of different things, eg. IO, crypto, etc., and in many cases they're able to directly access the same memory, I got the feeling that we should generalize gems even further, so that they could be any kind of buffer that may be passed to any kind of device. (hmm, reminds me on some ancient mainframe concepts).
If the blitter is always attached to the display block just add a few gem based ioctls there (like with desktop gpus) for submitting blit workloads. Otherwise new driver I guess.
hmm, can I use gems outside DRM ? eg. would it be possible to write an storage controller driver that directly accesses an some gem (eg. let the controller write out an gem object) ?
--mtx
On 05.08.2016 01:16, Enrico Weigelt, metux IT consult wrote:
<snip> Seems I've been on a completely wrong path - what I'm looking for is dma-buf. So my idea now goes like this:
* add a new 'virtual GPU' as render node. * the basic operations are: -> create a virtual dumb framebuffer (just inside system memory), -> import dma-buf's as bo's -> blitting between bo's using dma-engine
That way, everything should be cleanly separated.
As the application needs to be aware of that buffer-and-blit approach anyways (IOW: allocate two BO's and trigger the blitting when it done rendering), the extra glue needed for opening and talking to the render node should be quite minimal.
--mtx
On Fri, Aug 05, 2016 at 06:37:26AM +0200, Enrico Weigelt, metux IT consult wrote:
On 05.08.2016 01:16, Enrico Weigelt, metux IT consult wrote:
<snip> Seems I've been on a completely wrong path - what I'm looking for is dma-buf. So my idea now goes like this:
- add a new 'virtual GPU' as render node.
- the basic operations are: -> create a virtual dumb framebuffer (just inside system memory), -> import dma-buf's as bo's -> blitting between bo's using dma-engine
That way, everything should be cleanly separated.
As the application needs to be aware of that buffer-and-blit approach anyways (IOW: allocate two BO's and trigger the blitting when it done rendering), the extra glue needed for opening and talking to the render node should be quite minimal.
Yup, this is pretty much what I've beens suggesting ;-) The other bit is that pls don't try to make the IOCTL/uapi interfaces generic, it will hurt. Of course if there's a pile of IP (from the same vendor or whatever) that all works similarly then sure, shared driver makes sense. But pretty soon it doesn't (usually right when you want to have something closer to direct submission to hardware with relocations). -Daniel
On Fri, Aug 05, 2016 at 01:16:55AM +0200, Enrico Weigelt, metux IT consult wrote:
On 04.08.2016 09:50, Daniel Vetter wrote:
Hi,
One problem with 2d blitters is that there's no common userspace interface, but many: Xrender, hwc, old X drawing api, various attempts by khronos to standardize something, cairo, ...
We're talking about userland APIs, not kernel->userland interfaces. For userland APIs, I'm right now primarily interested in cairo (using it for my tiny widget toolkit) ... but I'm also thinking about setting X ontop of someting cairo-alike some day - or making gallium that layer.
It's probably worse than video decoding even, and definitely not like on the 3d side where there's GL (and now vulkan) and that's it.
On video side we have v4l for the kernel interface and gst as userland framework ... looks like a good compromise to me.
So you you'll end up with tons of glue code everywhere anyway.
Actually, I'd like to get the glue code smaller. Putting both cairo and X onto the common driver base (something that's somewhere between xorg video drivers and cairo surface backends) seems a good way to go, even though there'll be a lot of work to do for that.
Adding yet another kernel uapi doesn't help, but forcing it to be generic will make sure it's inefficient. Which means someone else then will create another one.
hmm, I'm not yet convinced that it necessarily will be inefficient.
To clarify the scope: I'm talking only about _dedicated_ units, which are completely orthogonal to complex gpus (basicly, just specialized dma controllers).
I personally don't care so much whether it's in DRM, V4L or whatever. DRM just seemed to be a good place to me.
By the way: as the number of such controllers increases, for dozens of different things, eg. IO, crypto, etc., and in many cases they're able to directly access the same memory, I got the feeling that we should generalize gems even further, so that they could be any kind of buffer that may be passed to any kind of device. (hmm, reminds me on some ancient mainframe concepts).
If the blitter is always attached to the display block just add a few gem based ioctls there (like with desktop gpus) for submitting blit workloads. Otherwise new driver I guess.
hmm, can I use gems outside DRM ? eg. would it be possible to write an storage controller driver that directly accesses an some gem (eg. let the controller write out an gem object) ?
Of course. In drm you can export/import gem buffers from to dma-buf. See
https://dri.freedesktop.org/docs/drm/gpu/drm-mm.html#prime-buffer-sharing
Cheers, Daniel
On 03.08.2016 11:24, Marek Szyprowski wrote:
Hi,
I'm working now on something similar, but more generic. There is already a framework for picture processing (converting, scaling, blitting, rotating) in Exynos DRM.
In DRM, not v4l ? Hmm, interesting.
On mx5/mx6 we've got an IPU, which is accessible via v4l, eg.for colorspace conversion, jpeg encode/decode, rotation, etc. (anyone of the involved folks @ptx here on the list ?)
Yet another overlap between DRM and V4L (IMHO, seems to be a matter of perspective of perspective and usecases, where to put such stuff in)
By the way: what's the status of sharing buffers between DRM and V4L ? I could also live with having such an hw-based image-copy operation living within v4l, when they're operating on the the same buffers.
http://thread.gmane.org/gmane.linux.kernel.samsung-soc/49743
seems to be offline
I plan to propose an API based on DRM object/properties, which will be similar to KMS atomic API. I will let you know when I have it ready for presenting in public.
hmm, I'm getting curious ...
--mtx
dri-devel@lists.freedesktop.org