DMA-buf and uncached system memory

List overview All Threads
Download

newer

older

[PATCH v5 0/3] drm/doc/rfc: i915...

[PATCH 0/3] drm: Test for primary...

Christian König

15 Feb 2021 15 Feb '21

8:58 a.m.

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian.

Show replies by date

Simon Ser

15 Feb 15 Feb

9:06 a.m.

On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:

...

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Intel also needs uncached system memory if I'm not mistaken?

Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.

If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].

Simon

[1]: https://xdc2020.x.org/event/9/contributions/615/

Christian König

9:34 a.m.

Am 15.02.21 um 10:06 schrieb Simon Ser:

...

On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:

...
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Intel also needs uncached system memory if I'm not mistaken?

No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.

...

Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.

If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].

Well the problem here is on a different level of the stack.

See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.

But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot

X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.

So to sum it up that is not something which we can leave in the hands of userspace.

I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.

Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.

Regards, Christian.

...

Simon

Lucas Stach

11:53 a.m.

Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:

...

Am 15.02.21 um 10:06 schrieb Simon Ser:

...
On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:

...
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Intel also needs uncached system memory if I'm not mistaken?

No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.

...
Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.

If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].

Well the problem here is on a different level of the stack.

See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.

But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot

X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.

So to sum it up that is not something which we can leave in the hands of userspace.

I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.

There is already a coarse-grained way to do so: the dma_coherent property in struct device, which you can check at dmabuf attach time.

However it may not be enough for the requirements of a GPU where the engines could differ in their dma coherency requirements. For that you need to either have fake struct devices for the individual engines or come up with a more fine-grained way to communicate those requirements.

...

Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.

What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?

Regards, Lucas

Christian König

12:04 p.m.

Am 15.02.21 um 12:53 schrieb Lucas Stach:

...

Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:

...
Am 15.02.21 um 10:06 schrieb Simon Ser:

...
On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:

...
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Intel also needs uncached system memory if I'm not mistaken?

No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.

...
Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.

If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].

Well the problem here is on a different level of the stack.

See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.

But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot

X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.

So to sum it up that is not something which we can leave in the hands of userspace.

I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.

There is already a coarse-grained way to do so: the dma_coherent property in struct device, which you can check at dmabuf attach time.

However it may not be enough for the requirements of a GPU where the engines could differ in their dma coherency requirements. For that you need to either have fake struct devices for the individual engines or come up with a more fine-grained way to communicate those requirements.

Yeah, that won't work. We need this on a per buffer level.

...

...
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.

What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?

Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.

On the other hand when the don't snoop the CPU caches we at least get garbage/stale data on the screen. That wouldn't be that worse, but the big problem is that we have also seen machine check exceptions when don't snoop and the cache is dirty.

So this should better be coherent or you can crash the box. ARM seems to be really susceptible for this, x86 is fortunately much more graceful and I'm not sure about other architectures.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

12:16 p.m.

Am Montag, dem 15.02.2021 um 13:04 +0100 schrieb Christian König:

...

Am 15.02.21 um 12:53 schrieb Lucas Stach:

...
Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:

...
Am 15.02.21 um 10:06 schrieb Simon Ser:

...
On Monday, February 15th, 2021 at 9:58 AM, Christian König christian.koenig@amd.com wrote:

...
we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Intel also needs uncached system memory if I'm not mistaken?

No idea, that's why I'm asking. Could be that this is also interesting for I+A systems.

...
Where are the buffers allocated? If GBM, then it needs to allocate memory that can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable modifier is picked.

If this is about communicating buffer constraints between different components of the stack, there were a few proposals about it. The most recent one is [1].

Well the problem here is on a different level of the stack.

See resolution, pitch etc:.. can easily communicated in userspace without involvement of the kernel. The worst thing which can happen is that you draw garbage into your own application window.

But if you get the caching attributes in the page tables (both CPU as well as IOMMU, device etc...) wrong then ARM for example has the tendency to just spontaneously reboot

X86 is fortunately a bit more gracefully and you only end up with random data corruption, but that is only marginally better.

So to sum it up that is not something which we can leave in the hands of userspace.

I think that exporters in the DMA-buf framework should have the ability to tell importers if the system memory snooping is necessary or not.

There is already a coarse-grained way to do so: the dma_coherent property in struct device, which you can check at dmabuf attach time.

However it may not be enough for the requirements of a GPU where the engines could differ in their dma coherency requirements. For that you need to either have fake struct devices for the individual engines or come up with a more fine-grained way to communicate those requirements.

Yeah, that won't work. We need this on a per buffer level.

...
...
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.

What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?

Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.

Okay, that confirms my theory on why this is needed. So things don't totally explode if you don't do it, but to in order to guarantee access latency you need to take the no-snoop path, which means your device effectively gets dma-noncoherent.

...

On the other hand when the don't snoop the CPU caches we at least get garbage/stale data on the screen. That wouldn't be that worse, but the big problem is that we have also seen machine check exceptions when don't snoop and the cache is dirty.

If you attach to the dma-buf with a struct device which is non-coherent it's the exporters job to flush any dirty caches. Unfortunately the DRM caching of the dma-buf attachments in the DRM framework will get a bit in the way here, so a DRM specific flush might be be needed. :/ Maybe moving the whole buffer to uncached sysmem location on first attach of a non-coherent importer would be enough?

...

So this should better be coherent or you can crash the box. ARM seems to be really susceptible for this, x86 is fortunately much more graceful and I'm not sure about other architectures.

ARM really dislikes pagetable setups with different attributes pointing to the same physical page, however you should be fine as long as all cached aliases are properly flushed from the cache before access via a different alias.

Regards, Lucas

Christian König

12:25 p.m.

Am 15.02.21 um 13:16 schrieb Lucas Stach:

...

[SNIP]

...
...
...
Userspace components can then of course tell the exporter what the importer needs, but validation if that stuff is correct and doesn't crash the system must happen in the kernel.

What exactly do you mean by "scanout requires non-coherent memory"? Does the scanout requestor always set the no-snoop PCI flag, so you get garbage if some writes to memory are still stuck in the caches, or is it some other requirement?

Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.

Okay, that confirms my theory on why this is needed. So things don't totally explode if you don't do it, but to in order to guarantee access latency you need to take the no-snoop path, which means your device effectively gets dma-noncoherent.

Exactly. My big question at the moment is if this is something AMD specific or do we have the same issue on other devices as well?

...

...
On the other hand when the don't snoop the CPU caches we at least get garbage/stale data on the screen. That wouldn't be that worse, but the big problem is that we have also seen machine check exceptions when don't snoop and the cache is dirty.

If you attach to the dma-buf with a struct device which is non-coherent it's the exporters job to flush any dirty caches. Unfortunately the DRM caching of the dma-buf attachments in the DRM framework will get a bit in the way here, so a DRM specific flush might be be needed. :/ Maybe moving the whole buffer to uncached sysmem location on first attach of a non-coherent importer would be enough?

Could work in theory, but problem is that for this to do I have to tear down all CPU mappings and attachments of other devices.

Apart from the problem that we don't have the infrastructure for that we don't know at import time that a buffer might be used for scan out. I would need to re-import it during fb creation or something like this.

Our current concept for AMD GPUs is rather that we try to use uncached memory as much as possible. So for the specific use case just checking if the exporter is AMDGPU and has the flag set should be enough for not.

...

...
So this should better be coherent or you can crash the box. ARM seems to be really susceptible for this, x86 is fortunately much more graceful and I'm not sure about other architectures.

ARM really dislikes pagetable setups with different attributes pointing to the same physical page, however you should be fine as long as all cached aliases are properly flushed from the cache before access via a different alias.

Yeah, can totally confirm that and had to learn it the hard way.

Regards, Christian.

...

Regards, Lucas

David Laight

2:41 p.m.

From: Christian König

...

Sent: 15 February 2021 12:05

...

Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.

Aren't you going to get the same problem if any other endpoints are doing memory reads? Possibly even ones that don't require a cache snoop and flush.

What about just the cpu doing a real memory transfer?

Or a combination of the two above happening just before your request.

If you don't have a big enough fifo you'll lose.

I did 'fix' a similar(ish) issue with video DMA latency on an embedded system based the on SA1100/SA1101 by significantly reducing the clock to the VGA panel whenever the cpu was doing 'slow io'. (Interleaving an uncached cpu DRAM write between the slow io cycles also fixed it.) But the video was the only DMA device and that was an embedded system. Given the application note about video latency didn't mention what was actually happening, I'm not sure how many people actually got it working!

David

- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)

Christian König

2:54 p.m.

New subject: [Linaro-mm-sig] DMA-buf and uncached system memory

Am 15.02.21 um 15:41 schrieb David Laight:

...

From: Christian König

...
Sent: 15 February 2021 12:05

...

...
Snooping the CPU caches introduces some extra latency, so what can happen is that the response to the PCIe read comes to late for the scanout. The result is an underflow and flickering whenever something is in the cache which needs to be flushed first.

Aren't you going to get the same problem if any other endpoints are doing memory reads?

The PCIe device in this case is part of the SoC, so we have a high priority channel to memory.

Because of this the hardware designer assumed they have a guaranteed memory latency.

...

Possibly even ones that don't require a cache snoop and flush.

What about just the cpu doing a real memory transfer?

Or a combination of the two above happening just before your request.

If you don't have a big enough fifo you'll lose.

I did 'fix' a similar(ish) issue with video DMA latency on an embedded system based the on SA1100/SA1101 by significantly reducing the clock to the VGA panel whenever the cpu was doing 'slow io'. (Interleaving an uncached cpu DRAM write between the slow io cycles also fixed it.) But the video was the only DMA device and that was an embedded system. Given the application note about video latency didn't mention what was actually happening, I'm not sure how many people actually got it working!

Yeah, I'm also not sure if AMD doesn't solve this with deeper fifos or more prefetching in future designs.

But you gave me at least one example where somebody had similar problems.

Thanks for the feedback, Christian.

...

David

Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) _______________________________________________ Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig

Thomas Zimmermann

9:49 a.m.

Am 15.02.21 um 09:58 schrieb Christian König:

...

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

For vmap operations, we return the address as struct dma_buf_map, which contains additional information about the memory buffer. In vram helpers, we have the interface drm_gem_vram_offset() that returns the offset of the GPU device memory.

Would it be feasible to combine both concepts into a dma-buf interface that returns the device-memory offset plus the additional caching flag?

There'd be a structure and a getter function returning the structure.

struct dma_buf_offset { bool cached; u64 address; };

// return offset in *off int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);

Whatever settings are returned by dma_buf_offset() are valid while the dma_buf is pinned.

Best regards Thomas

...

After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Felix Imendörffer

Thomas Zimmermann

noon

Am 15.02.21 um 10:49 schrieb Thomas Zimmermann:

...

Hi

Am 15.02.21 um 09:58 schrieb Christian König:

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

Re-reading this paragrah, it sounds more as if you want to let the exporter know where to move the buffer. Is this another case of the missing-pin-flag problem?

Best regards Thomas

...

...
For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

For vmap operations, we return the address as struct dma_buf_map, which contains additional information about the memory buffer. In vram helpers, we have the interface drm_gem_vram_offset() that returns the offset of the GPU device memory.

Would it be feasible to combine both concepts into a dma-buf interface that returns the device-memory offset plus the additional caching flag?

There'd be a structure and a getter function returning the structure.

struct dma_buf_offset { bool cached; u64 address; };

// return offset in *off int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);

Whatever settings are returned by dma_buf_offset() are valid while the dma_buf is pinned.

Best regards Thomas

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Felix Imendörffer

Christian König

12:10 p.m.

Am 15.02.21 um 13:00 schrieb Thomas Zimmermann:

...

Hi

Am 15.02.21 um 10:49 schrieb Thomas Zimmermann:

...
Hi

Am 15.02.21 um 09:58 schrieb Christian König:

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

Re-reading this paragrah, it sounds more as if you want to let the exporter know where to move the buffer. Is this another case of the missing-pin-flag problem?

No, your original interpretation was correct. Maybe my writing is a bit unspecific.

The real underlying issue is that our display hardware has a problem with latency when accessing system memory.

So the question is if that also applies to for example Intel hardware or other devices as well or if it is just something AMD specific?

Regards, Christian.

...

Best regards Thomas

...
...
For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

For vmap operations, we return the address as struct dma_buf_map, which contains additional information about the memory buffer. In vram helpers, we have the interface drm_gem_vram_offset() that returns the offset of the GPU device memory.

Would it be feasible to combine both concepts into a dma-buf interface that returns the device-memory offset plus the additional caching flag?

There'd be a structure and a getter function returning the structure.

struct dma_buf_offset { bool cached; u64 address; };

// return offset in *off int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);

Whatever settings are returned by dma_buf_offset() are valid while the dma_buf is pinned.

Best regards Thomas

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Nicolas Dufresne

8:46 p.m.

Le lundi 15 février 2021 à 13:10 +0100, Christian König a écrit :

...

Am 15.02.21 um 13:00 schrieb Thomas Zimmermann:

...
Hi

Am 15.02.21 um 10:49 schrieb Thomas Zimmermann:

...
Hi

Am 15.02.21 um 09:58 schrieb Christian König:

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

Re-reading this paragrah, it sounds more as if you want to let the exporter know where to move the buffer. Is this another case of the missing-pin-flag problem?

No, your original interpretation was correct. Maybe my writing is a bit unspecific.

The real underlying issue is that our display hardware has a problem with latency when accessing system memory.

So the question is if that also applies to for example Intel hardware or other devices as well or if it is just something AMD specific?

I do believe that the answer is yes, Intel display have similar issue with latency, hence requires un-cached memory.

...

Regards, Christian.

...
Best regards Thomas

...
...
For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

For vmap operations, we return the address as struct dma_buf_map, which contains additional information about the memory buffer. In vram helpers, we have the interface drm_gem_vram_offset() that returns the offset of the GPU device memory.

Would it be feasible to combine both concepts into a dma-buf interface that returns the device-memory offset plus the additional caching flag?

There'd be a structure and a getter function returning the structure.

struct dma_buf_offset { bool cached; u64 address; };

// return offset in *off int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);

Whatever settings are returned by dma_buf_offset() are valid while the dma_buf is pinned.

Best regards Thomas

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Nicolas Dufresne

8:39 p.m.

Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :

...

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

Hopefully I'm getting this right, but this makes me think of a long standing issue I've met with Intel DRM and UVC driver. If I let the UVC driver allocate the buffer, and import the resulting DMABuf (cacheable memory written with a cpu copy in the kernel) into DRM, we can see cache artifact being displayed. While if I use the DRM driver memory (dumb buffer in that case) it's clean because there is a driver specific solution to that.

There is no obvious way for userspace application to know what's is right/wrong way and in fact it feels like the kernel could solve this somehow without having to inform userspace (perhaps).

...

After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian.

Andy.Hsieh

21 Jun 21 Jun

10:17 a.m.

On 2/16/21 4:39 AM, Nicolas Dufresne wrote:

...

Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

Hopefully I'm getting this right, but this makes me think of a long standing issue I've met with Intel DRM and UVC driver. If I let the UVC driver allocate the buffer, and import the resulting DMABuf (cacheable memory written with a cpu copy in the kernel) into DRM, we can see cache artifact being displayed. While if I use the DRM driver memory (dumb buffer in that case) it's clean because there is a driver specific solution to that.

There is no obvious way for userspace application to know what's is right/wrong way and in fact it feels like the kernel could solve this somehow without having to inform userspace (perhaps).

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian.

Hi All,

We also encountered the UVC cache issue on ARMv8 CPU in Mediatek SoC when using UVC dmabuf-export and feeding the dmabuf to the DRM display by the following GStreamer command:

# gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf ! kmssink

UVC driver uses videobuf2-vmalloc to allocate buffers and is able to export them as dmabuf. But UVC uses memcpy() to fill the frame buffer by CPU without flushing the cache. So if the display hardware directly uses the buffer, the image shown on the screen will be dirty.

Here are some experiments:

1. By doing some memory operations (e.g. devmem) when streaming the UVC, the issue is mitigated. I guess the cache is swapped rapidly. 2. By replacing the memcpy() with memcpy_flushcache() in the UVC driver, the issue disappears. 3. By adding .finish callback in videobuf2-vmalloc.c to flush the cache before returning the buffer, the issue disappears.

It seems to lack a cache flush stage in either UVC or Display. We may also need communication between the producer and consumer. Then, they can decide who is responsible for the flushing to avoid flushing cache unconditionally leading to the performance impact.

Regards, Andy Hsieh

Christian König

10:34 a.m.

Hi Andy,

Am 21.06.22 um 12:17 schrieb Andy.Hsieh:

...

On 2/16/21 4:39 AM, Nicolas Dufresne wrote:

...
Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

Hopefully I'm getting this right, but this makes me think of a long standing issue I've met with Intel DRM and UVC driver. If I let the UVC driver allocate the buffer, and import the resulting DMABuf (cacheable memory written with a cpu copy in the kernel) into DRM, we can see cache artifact being displayed. While if I use the DRM driver memory (dumb buffer in that case) it's clean because there is a driver specific solution to that.

There is no obvious way for userspace application to know what's is right/wrong way and in fact it feels like the kernel could solve this somehow without having to inform userspace (perhaps).

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian.

Hi All,

We also encountered the UVC cache issue on ARMv8 CPU in Mediatek SoC when using UVC dmabuf-export and feeding the dmabuf to the DRM display by the following GStreamer command:

# gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf ! kmssink

UVC driver uses videobuf2-vmalloc to allocate buffers and is able to export them as dmabuf. But UVC uses memcpy() to fill the frame buffer by CPU without flushing the cache. So if the display hardware directly uses the buffer, the image shown on the screen will be dirty.

Here are some experiments:

By doing some memory operations (e.g. devmem) when streaming the UVC, the issue is mitigated. I guess the cache is swapped rapidly.

By replacing the memcpy() with memcpy_flushcache() in the UVC driver, the issue disappears.

By adding .finish callback in videobuf2-vmalloc.c to flush the cache before returning the buffer, the issue disappears.

It seems to lack a cache flush stage in either UVC or Display. We may also need communication between the producer and consumer. Then, they can decide who is responsible for the flushing to avoid flushing cache unconditionally leading to the performance impact.

Well, that's not what this mail thread was all about.

The issue you are facing is that somebody is forgetting to flush caches, but the issue discussed in this thread here is that we have hardware which bypasses caches altogether.

As far as I can see in your case UVC just allocates normal cached system memory through videobuf2-vmalloc() and it is perfectly valid to fill that using memcpy().

If some hardware then accesses those buffers bypassing CPU caches then it is the responsibility of the importing driver and/or DMA subsystem to flush the caches accordingly.

Regards, Christian.

...

Regards, Andy Hsieh

************* MEDIATEK Confidentiality Notice ******************** The information contained in this e-mail message (including any attachments) may be confidential, proprietary, privileged, or otherwise exempt from disclosure under applicable laws. It is intended to be conveyed only to the designated recipient(s). Any use, dissemination, distribution, printing, retaining or copying of this e-mail (including its attachments) by unintended recipient(s) is strictly prohibited and may be unlawful. If you are not an intended recipient of this e-mail, or believe that you have received this e-mail in error, please notify the sender immediately (by replying to this e-mail), delete any and all copies of this e-mail (including any attachments) from your system, and do not disclose the content of this e-mail to any other person. Thank you!

Nicolas Dufresne

3:42 p.m.

Hi Christian and Andy,

Le mardi 21 juin 2022 à 12:34 +0200, Christian König a écrit :

...

Hi Andy, Am 21.06.22 um 12:17 schrieb Andy.Hsieh:

...
On 2/16/21 4:39 AM, Nicolas Dufresne wrote:

...
Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

Hopefully I'm getting this right, but this makes me think of a long standing issue I've met with Intel DRM and UVC driver. If I let the UVC driver allocate the buffer, and import the resulting DMABuf (cacheable memory written with a cpu copy in the kernel) into DRM, we can see cache artifact being displayed. While if I use the DRM driver memory (dumb buffer in that case) it's clean because there is a driver specific solution to that.

There is no obvious way for userspace application to know what's is right/wrong way and in fact it feels like the kernel could solve this somehow without having to inform userspace (perhaps).

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian.

Hi All,

We also encountered the UVC cache issue on ARMv8 CPU in Mediatek SoC when using UVC dmabuf-export and feeding the dmabuf to the DRM display by the following GStreamer command:

# gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf ! kmssink

UVC driver uses videobuf2-vmalloc to allocate buffers and is able to export them as dmabuf. But UVC uses memcpy() to fill the frame buffer by CPU without flushing the cache. So if the display hardware directly uses the buffer, the image shown on the screen will be dirty.

Here are some experiments:

By doing some memory operations (e.g. devmem) when streaming the UVC,

the issue is mitigated. I guess the cache is swapped rapidly. 2. By replacing the memcpy() with memcpy_flushcache() in the UVC driver, the issue disappears. 3. By adding .finish callback in videobuf2-vmalloc.c to flush the cache before returning the buffer, the issue disappears.

It seems to lack a cache flush stage in either UVC or Display. We may also need communication between the producer and consumer. Then, they can decide who is responsible for the flushing to avoid flushing cache unconditionally leading to the performance impact.

Well, that's not what this mail thread was all about. The issue you are facing is that somebody is forgetting to flush caches, but the issue discussed in this thread here is that we have hardware which bypasses caches altogether. As far as I can see in your case UVC just allocates normal cached system memory through videobuf2-vmalloc() and it is perfectly valid to fill that using memcpy(). If some hardware then accesses those buffers bypassing CPU caches then it is the responsibility of the importing driver and/or DMA subsystem to flush the caches accordingly.

I've tracked this down to videobuf2-vmalloc.c failing to look for coherency during "attach()". It is also missing begin_/end access implementation for the case it get attached to a non-coherent device. Seems fixable though, but "I'm far from an expert", but more some random person reading code and comments.

regards, Nicolas

...

Regards, Christian.

...
Regards, Andy Hsieh

************* MEDIATEK Confidentiality Notice ******************** The information contained in this e-mail message (including any attachments) may be confidential, proprietary, privileged, or otherwise exempt from disclosure under applicable laws. It is intended to be conveyed only to the designated recipient(s). Any use, dissemination, distribution, printing, retaining or copying of this e-mail (including its attachments) by unintended recipient(s) is strictly prohibited and may be unlawful. If you are not an intended recipient of this e-mail, or believe that you have received this e-mail in error, please notify the sender immediately (by replying to this e-mail), delete any and all copies of this e-mail (including any attachments) from your system, and do not disclose the content of this e-mail to any other person. Thank you!

Christian König

22 Jun 22 Jun

9:05 a.m.

New subject: [Linaro-mm-sig] Re: DMA-buf and uncached system memory

Am 21.06.22 um 17:42 schrieb Nicolas Dufresne:

...

Hi Christian and Andy,

Le mardi 21 juin 2022 à 12:34 +0200, Christian König a écrit :

...
Hi Andy,

Am 21.06.22 um 12:17 schrieb Andy.Hsieh:

...
On 2/16/21 4:39 AM, Nicolas Dufresne wrote:

...
Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

Hopefully I'm getting this right, but this makes me think of a long standing issue I've met with Intel DRM and UVC driver. If I let the UVC driver allocate the buffer, and import the resulting DMABuf (cacheable memory written with a cpu copy in the kernel) into DRM, we can see cache artifact being displayed. While if I use the DRM driver memory (dumb buffer in that case) it's clean because there is a driver specific solution to that.

There is no obvious way for userspace application to know what's is right/wrong way and in fact it feels like the kernel could solve this somehow without having to inform userspace (perhaps).

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian.

Hi All,

We also encountered the UVC cache issue on ARMv8 CPU in Mediatek SoC when using UVC dmabuf-export and feeding the dmabuf to the DRM display by the following GStreamer command:

# gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf ! kmssink

UVC driver uses videobuf2-vmalloc to allocate buffers and is able to export them as dmabuf. But UVC uses memcpy() to fill the frame buffer by CPU without flushing the cache. So if the display hardware directly uses the buffer, the image shown on the screen will be dirty.

Here are some experiments:

By doing some memory operations (e.g. devmem) when streaming the UVC,

the issue is mitigated. I guess the cache is swapped rapidly. 2. By replacing the memcpy() with memcpy_flushcache() in the UVC driver, the issue disappears. 3. By adding .finish callback in videobuf2-vmalloc.c to flush the cache before returning the buffer, the issue disappears.

It seems to lack a cache flush stage in either UVC or Display. We may also need communication between the producer and consumer. Then, they can decide who is responsible for the flushing to avoid flushing cache unconditionally leading to the performance impact.

Well, that's not what this mail thread was all about.

The issue you are facing is that somebody is forgetting to flush caches, but the issue discussed in this thread here is that we have hardware which bypasses caches altogether.

As far as I can see in your case UVC just allocates normal cached system memory through videobuf2-vmalloc() and it is perfectly valid to fill that using memcpy().

If some hardware then accesses those buffers bypassing CPU caches then it is the responsibility of the importing driver and/or DMA subsystem to flush the caches accordingly.

I've tracked this down to videobuf2-vmalloc.c failing to look for coherency during "attach()". It is also missing begin_/end access implementation for the case it get attached to a non-coherent device. Seems fixable though, but "I'm far from an expert", but more some random person reading code and comments.

Well that is perfectly expected behavior, videobuf2-vmalloc return normal cached system memory.

So it doesn't care for the coherency of the buffer.

What should happen instead is that the display device needs to make sure that it can coherently access the data and that's not the case here.

Regards, Christian.

...

regards, Nicolas

...
Regards, Christian.

...
Regards, Andy Hsieh

************* MEDIATEK Confidentiality Notice ******************** The information contained in this e-mail message (including any attachments) may be confidential, proprietary, privileged, or otherwise exempt from disclosure under applicable laws. It is intended to be conveyed only to the designated recipient(s). Any use, dissemination, distribution, printing, retaining or copying of this e-mail (including its attachments) by unintended recipient(s) is strictly prohibited and may be unlawful. If you are not an intended recipient of this e-mail, or believe that you have received this e-mail in error, please notify the sender immediately (by replying to this e-mail), delete any and all copies of this e-mail (including any attachments) from your system, and do not disclose the content of this e-mail to any other person. Thank you!

Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

Daniel Vetter

16 Feb 16 Feb

9:25 a.m.

On Mon, Feb 15, 2021 at 09:58:08AM +0100, Christian König wrote:

...

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Yes it's a problem, and it's a complete mess. So the defacto rules are:

1. importer has to do coherent transactions to snoop cpu caches.

This way both modes work:

- if the buffer is cached, we're fine

- if the buffer is not cached, but the exporter has flushed all the caches, you're mostly just wasting time on inefficient bus cycles. Also this doesn't work if your CPU doesn't just drop clean cachelines. Like I thing some ARM are prone to do, not idea about AMD, Intel is guaranteed to drop them which is why the uncached scanout for integrated Intel + amd discrete "works".

2. exporters picks the mode freely, and can even change it at runtime (i915 does this, since we don't have an "allocate for scanout" flag wired through consistently). This doesn't work on arm, there the rule is "all devices in the same system must use the same mode".

3. This should be solved at the dma-buf layer, but the dma-api refuses to tell you this information (at least for dma_alloc_coherent). And I'm not going to deal with the bikeshed that would bring into my inbox. Or at least there's always been screaming that drivers shouldn't peek behind the abstraction.

So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

What we imo shouldn't do is driver private interfaces here, that's just going to make the problem worse long term. Or at least driver-private interfaces that spawn across drivers behind dma-buf, because imo this is really a problem that dma-buf should solve.

If you do want to solve this at the dma-buf level I can try and point you at the respective i915 and amdgpu code that makes the magic work - I've had to fix it a few times in the past. I'm not sure whether we'd need to pass the dynamic nature through though, i.e. whether we want to be able to scan out imported dma-buf and hence request they be used in uncached mode. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Nicolas Dufresne

22 Jun 22 Jun

7:39 p.m.

Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :

...

So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you also get nice dirt on the screen. If you have a UVC webcam that produces a pixel format compatible with your display, you can reproduce the issue quite easily with:

gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink

p.s. some frame-rate are less likely to exhibit the issue, make sure you create movement to see it.

The only solution I could think of (not implemented) was to detect in the attach() call what the importers can do (with dev->coherent_dma_mask if I recall), and otherwise flush the cache immediately and start flushing the cache from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't have an idea yet). I bet this idea is inapplicable to were you have fences, we don't have that in v4l2.

This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really a good plan for that, so one needs to make one up. I'm not aware oh an importer could know how the memory was allocated by the exporter, and worst, how an importer could figure-out that the export is going to produce buffer with hot CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a fixed size image).

Nicolas

Daniel Stone

11:34 p.m.

Hi Nicolas,

On Wed, 22 Jun 2022 at 20:39, Nicolas Dufresne nicolas@ndufresne.ca wrote:

...

Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :

...
So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you also get nice dirt on the screen. If you have a UVC webcam that produces a pixel format compatible with your display, you can reproduce the issue quite easily with:

gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink

p.s. some frame-rate are less likely to exhibit the issue, make sure you create movement to see it.

Right, this is because the UVC data in a vmalloc() area is not necessarily flushed from the CPU cache, and the importer expects it will be.

...

The only solution I could think of (not implemented) was to detect in the attach() call what the importers can do (with dev->coherent_dma_mask if I recall), and otherwise flush the cache immediately and start flushing the cache from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't have an idea yet). I bet this idea is inapplicable to were you have fences, we don't have that in v4l2.

This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really a good plan for that, so one needs to make one up. I'm not aware oh an importer could know how the memory was allocated by the exporter, and worst, how an importer could figure-out that the export is going to produce buffer with hot CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a fixed size image).

This is exactly what Christian was saying above.

Cheers, Daniel

Christian König

23 Jun 23 Jun

6:59 a.m.

Am 23.06.22 um 01:34 schrieb Daniel Stone:

...

Hi Nicolas,

On Wed, 22 Jun 2022 at 20:39, Nicolas Dufresne nicolas@ndufresne.ca wrote:

...
Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :

...
So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you also get nice dirt on the screen. If you have a UVC webcam that produces a pixel format compatible with your display, you can reproduce the issue quite easily with:

gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink

p.s. some frame-rate are less likely to exhibit the issue, make sure you create movement to see it.

Right, this is because the UVC data in a vmalloc() area is not necessarily flushed from the CPU cache, and the importer expects it will be.

Yeah, but that is something perfectly valid for an exporter to do. So the bug is not in UVC.

...

...
The only solution I could think of (not implemented) was to detect in the attach() call what the importers can do (with dev->coherent_dma_mask if I recall), and otherwise flush the cache immediately and start flushing the cache from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't have an idea yet). I bet this idea is inapplicable to were you have fences, we don't have that in v4l2.

This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really a good plan for that, so one needs to make one up. I'm not aware oh an importer could know how the memory was allocated by the exporter, and worst, how an importer could figure-out that the export is going to produce buffer with hot CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a fixed size image).

This is exactly what Christian was saying above.

Well more or less.

The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

The importer needs to be able to deal with that. Either by flushing the CPU cache manually (awkward), rejecting the DMA-buf for this use case (display scanout) or working around that inside it's driver (extra copy, different hw settings, etc...).

Regards, Christian.

...

Cheers, Daniel

Pekka Paalanen

7:13 a.m.

On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...

The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

Thanks, pq

Christian König

7:26 a.m.

Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...

On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...

Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

Regards, Christian.

...

Thanks, pq

Lucas Stach

8:04 a.m.

Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...

Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Regards, Lucas

Christian König

8:14 a.m.

Am 23.06.22 um 10:04 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

...

In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

8:58 a.m.

Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:

...

Am 23.06.22 um 10:04 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

Yes, that one is not really an issue as you know that at buffer creation time and can make sure to map those buffers uncached on non coherent arches. If there are no explicit domain transfer points non coherent must bite the bullet and bypass the CPU caches, running performance into the ground.

...

...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

Regards, Lucas

Christian König

9:09 a.m.

Am 23.06.22 um 10:58 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:

...
Am 23.06.22 um 10:04 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

Yes, that one is not really an issue as you know that at buffer creation time and can make sure to map those buffers uncached on non coherent arches. If there are no explicit domain transfer points non coherent must bite the bullet and bypass the CPU caches, running performance into the ground.

Yes, exactly that was what this mail thread was about. But this case is currently not supported by DMA-buf.

In other words, cache coherency is currently mandatory for everybody involved.

...

...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

9:33 a.m.

Am Donnerstag, dem 23.06.2022 um 11:09 +0200 schrieb Christian König:

...

Am 23.06.22 um 10:58 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:

...
Am 23.06.22 um 10:04 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

> The exporter isn't doing anything wrong here. DMA-buf are supposed to be > CPU cached and can also be cache hot. Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

Yes, that one is not really an issue as you know that at buffer creation time and can make sure to map those buffers uncached on non coherent arches. If there are no explicit domain transfer points non coherent must bite the bullet and bypass the CPU caches, running performance into the ground.

Yes, exactly that was what this mail thread was about. But this case is currently not supported by DMA-buf.

In other words, cache coherency is currently mandatory for everybody involved.

...
...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

How does the importer know that the memory behind the DMA-buf is in CPU cached memory?

If you now tell me that an importer always needs to assume this and reject the import if it can't do snooping, then any DMA-buf usage on most ARM SoCs is currently invalid usage. On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...

See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Regards, Lucas

Christian König

9:46 a.m.

Am 23.06.22 um 11:33 schrieb Lucas Stach:

...

[SNIP]

...
...
...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

How does the importer know that the memory behind the DMA-buf is in CPU cached memory?

If you now tell me that an importer always needs to assume this and reject the import if it can't do snooping, then any DMA-buf usage on most ARM SoCs is currently invalid usage.

Yes, exactly that. I've pointed out a couple of times now that a lot of ARM SoCs don't implement that the way we need it.

We already had tons of bug reports because somebody attached a random PCI root complex to an ARM SoC and expected it to work with for example an AMD GPU.

Non-cache coherent applications are currently not really supported by the DMA-buf framework in any way.

...

On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

10:13 a.m.

Am Donnerstag, dem 23.06.2022 um 11:46 +0200 schrieb Christian König:

...

Am 23.06.22 um 11:33 schrieb Lucas Stach:

...
[SNIP]

...
...
...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

How does the importer know that the memory behind the DMA-buf is in CPU cached memory?

If you now tell me that an importer always needs to assume this and reject the import if it can't do snooping, then any DMA-buf usage on most ARM SoCs is currently invalid usage.

Yes, exactly that. I've pointed out a couple of times now that a lot of ARM SoCs don't implement that the way we need it.

We already had tons of bug reports because somebody attached a random PCI root complex to an ARM SoC and expected it to work with for example an AMD GPU.

Non-cache coherent applications are currently not really supported by the DMA-buf framework in any way.

I'm not talking about bolting on a PCIe root complex, with its implicit inherited "PCI is cache coherent" expectations to a ARM SoC, but just the standard VPU/GPU/display engines are not snooping on most ARM SoCs.

...

...
On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

Sure, all the non-coherent arches have some way to do the cache maintenance in some explicit way. Non coherent and no cache maintenance instruction would be a recipe for desaster. ;)

...

What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

That's a very x86 centric world view you have there. 99% of DMA-buf uses on those cheap ARM SoCs is non-snooping. We can not do any fallbacks here, as the whole graphics world on those SoCs with their different IP cores mixed together depends on DMA-buf sharing working efficiently even when the SoC is mostly non coherent.

In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Regards, Lucas

Christian König

11:10 a.m.

Am 23.06.22 um 12:13 schrieb Lucas Stach:

...

[SNIP]

...
...
On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

Sure, all the non-coherent arches have some way to do the cache maintenance in some explicit way. Non coherent and no cache maintenance instruction would be a recipe for desaster. ;)

...
What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

That's a very x86 centric world view you have there. 99% of DMA-buf uses on those cheap ARM SoCs is non-snooping. We can not do any fallbacks here, as the whole graphics world on those SoCs with their different IP cores mixed together depends on DMA-buf sharing working efficiently even when the SoC is mostly non coherent.

In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

...

Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Regards, Christian.

...

Regards, Lucas

Daniel Stone

11:27 a.m.

Hi Christian,

On Thu, 23 Jun 2022 at 12:11, Christian König christian.koenig@amd.com wrote:

...

...
In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

For some strange reason, 'let's do buffer sharing but make sure it doesn't work on Arm' wasn't exactly the intention of the groups who came together under the Linaro umbrella to create dmabuf.

If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Cheers, Daniel

Christian König

11:32 a.m.

Am 23.06.22 um 13:27 schrieb Daniel Stone:

...

Hi Christian,

On Thu, 23 Jun 2022 at 12:11, Christian König christian.koenig@amd.com wrote:

...
...
In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

For some strange reason, 'let's do buffer sharing but make sure it doesn't work on Arm' wasn't exactly the intention of the groups who came together under the Linaro umbrella to create dmabuf.

If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Well, to be honest I think that would indeed be necessary.

What we have created are essentially two different worlds, one for PCI devices and one for the rest.

This was indeed not the intention, but it's a fact that basically all DMA-buf based PCI drivers assume coherent access.

Regards, Christian.

...

Cheers, Daniel

Lucas Stach

11:29 a.m.

Am Donnerstag, dem 23.06.2022 um 13:10 +0200 schrieb Christian König:

...

Am 23.06.22 um 12:13 schrieb Lucas Stach:

...
[SNIP]

...
...
On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

Sure, all the non-coherent arches have some way to do the cache maintenance in some explicit way. Non coherent and no cache maintenance instruction would be a recipe for desaster. ;)

...
What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

That's a very x86 centric world view you have there. 99% of DMA-buf uses on those cheap ARM SoCs is non-snooping. We can not do any fallbacks here, as the whole graphics world on those SoCs with their different IP cores mixed together depends on DMA-buf sharing working efficiently even when the SoC is mostly non coherent.

In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

Sorry, but this is just ignoring reality. You try to flag 8+ years of DMA-buf usage on non-coherent arches as "you shouldn't do this". At this point there are probably a lot more users (drivers) of DMA-buf in the kernel for devices, which are used on non-coherent arches, than there are on coherent arches.

...

...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

...

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Regards, Lucas

Christian König

11:54 a.m.

Am 23.06.22 um 13:29 schrieb Lucas Stach:

...

[SNIP]

...
Well then the existing DMA-buf framework is not what you want to use for this.

Sorry, but this is just ignoring reality. You try to flag 8+ years of DMA-buf usage on non-coherent arches as "you shouldn't do this". At this point there are probably a lot more users (drivers) of DMA-buf in the kernel for devices, which are used on non-coherent arches, than there are on coherent arches.

Well, it's my reality that people come up with bug reports about that and we have been pushing back on this with the explanation "Hey this is not supported" as long as I can think about it.

I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

I'm as much surprised as you are about this lack of agreement about such fundamental stuff.

...

...
...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

When we now have to introduce domain transitions to get non coherent access working we are essentially splitting all the drivers into coherent and non-coherent ones.

That doesn't sounds like it would improve interop.

...

Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

...

...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

12:14 p.m.

Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...

Am 23.06.22 um 13:29 schrieb Lucas Stach:

...
[SNIP]

...
Well then the existing DMA-buf framework is not what you want to use for this.

Sorry, but this is just ignoring reality. You try to flag 8+ years of DMA-buf usage on non-coherent arches as "you shouldn't do this". At this point there are probably a lot more users (drivers) of DMA-buf in the kernel for devices, which are used on non-coherent arches, than there are on coherent arches.

Well, it's my reality that people come up with bug reports about that and we have been pushing back on this with the explanation "Hey this is not supported" as long as I can think about it.

I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

...

I'm as much surprised as you are about this lack of agreement about such fundamental stuff.

...
...
...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

Coherent can mean 2 different things: 1. CPU cached with snooping from the IO device 2. CPU uncached

The Vulkan and GL "coherent" uses are really coherent without explicit domain transitions, so on non coherent arches that require the transitions the only way to implement this is by making the memory CPU uncached. Which from a performance PoV will probably not be what app developers expect, but will still expose the correct behavior.

...

When we now have to introduce domain transitions to get non coherent access working we are essentially splitting all the drivers into coherent and non-coherent ones.

That doesn't sounds like it would improve interop.

...
Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit documentation that "userspace can not rely on coherent access".

...

...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Regards, Lucas

Christian König

12:52 p.m.

Am 23.06.22 um 14:14 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

...

...
[SNIP]

...
Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

Coherent can mean 2 different things:

CPU cached with snooping from the IO device

CPU uncached

The Vulkan and GL "coherent" uses are really coherent without explicit domain transitions, so on non coherent arches that require the transitions the only way to implement this is by making the memory CPU uncached. Which from a performance PoV will probably not be what app developers expect, but will still expose the correct behavior.

Quite a boomer for performance, but yes that should work.

...

...
...
Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit documentation that "userspace can not rely on coherent access".

Yeah, double checked that as well. This is for the coherency case on the exporter side.

...

...
...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

Sounds like I'm not making clear what I want to say here: For the exporter using cache coherent memory is optional, for the importer it isn't.

For the exporter it is perfectly valid to use kmalloc, get_free_page etc... on his buffers as long as it uses the DMA API to give the importer access to it.

The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

Otherwise we would need to change all DMA-buf exporters to use a special function for allocation non-coherent memory and that is certainly not going to fly.

...

I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Well the key point is we intentionally rejected that design previously because it created all kind of trouble as well.

For this limited use case of doing a domain transition right before scanout it might make sense, but that's just one use case.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

3:26 p.m.

Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:

...

Am 23.06.22 um 14:14 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

I fail to construct a case where you want the Vulkan/GL "no domain transition" coherent semantic without the allocator knowing about this. If you need this and the system is non-snooping, surely the allocator will choose uncached memory.

I agree that you absolutely need to fail the usage when someone imports a CPU cached buffer and then tries to use it as GL coherent on a non- snooping system. That simply will not work.

...

...
...
[SNIP]

...
Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

Coherent can mean 2 different things:

CPU cached with snooping from the IO device

CPU uncached

The Vulkan and GL "coherent" uses are really coherent without explicit domain transitions, so on non coherent arches that require the transitions the only way to implement this is by making the memory CPU uncached. Which from a performance PoV will probably not be what app developers expect, but will still expose the correct behavior.

Quite a boomer for performance, but yes that should work.

...
...
...
Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit documentation that "userspace can not rely on coherent access".

Yeah, double checked that as well. This is for the coherency case on the exporter side.

...
...
...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

Sounds like I'm not making clear what I want to say here: For the exporter using cache coherent memory is optional, for the importer it isn't.

For the exporter it is perfectly valid to use kmalloc, get_free_page etc... on his buffers as long as it uses the DMA API to give the importer access to it.

And here is where our line of thought diverges: the DMA API allows snooping and non-snooping devices to work together just fine, as it has explicit domain transitions, which are no-ops if both devices are snooping, but will do the necessary cache maintenance when one of them is non-snooping but the memory is CPU cached.

I don't see why DMA-buf should be any different here. Yes, you can not support the "no domain transition" sharing when the memory is CPU cached and one of the devices in non-snooping, but you can support 99% of real use-cases like the non-snooped scanout or the UVC video import.

...

The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Why? The importer maps the dma-buf via dma_buf_map_attachment, which in most cases triggers a map via the DMA API on the exporter side. This map via the DMA API will already do the right thing in terms of cache management, it's just that we explicitly disable it via DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be cached, which violates the DMA API explicit domain transition anyway.

...

Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

It's not just display drivers, video codec accelerators and most GPUs in this space are also non-snooping. In the ARM SoC world everyone just assumes you are non-snooping, which is why things work for most cases and only a handful like the UVC video import is broken.

...

Otherwise we would need to change all DMA-buf exporters to use a special function for allocation non-coherent memory and that is certainly not going to fly.

...
I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Well the key point is we intentionally rejected that design previously because it created all kind of trouble as well.

I would really like to know what issues popped up there. Moving the dma-buf attachment to work more like a buffer used with the DMA API seems like a good thing to me.

...

For this limited use case of doing a domain transition right before scanout it might make sense, but that's just one use case.

The only case I see that we still couldn't support with a change in that direction is the GL coherent access to a imported buffer that has been allocated from CPU cached memory on a system with non-snooping agents. Which to me sounds like a pretty niche use-case, but I would be happy to be proven wrong.

Regards, Lucas

Christian König

24 Jun 24 Jun

6:54 a.m.

Am 23.06.22 um 17:26 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:

...
Am 23.06.22 um 14:14 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

I fail to construct a case where you want the Vulkan/GL "no domain transition" coherent semantic without the allocator knowing about this. If you need this and the system is non-snooping, surely the allocator will choose uncached memory.

No it won't. The allocator in the exporter is independent of the importer.

That is an important and intentional design decision, cause otherwise you wouldn't have exporters/importers in the first place and rather a centralized allocation pool like what dma-heap implements.

See the purpose of DMA-buf is to expose the buffers in the way the exporter wants to expose them. So when the exporting driver wants to allocate normal cached system memory then that is perfectly fine and completely fits into this design.

Otherwise we would need to adjust all exporters to the importers, which is potentially not even possible.

...

I agree that you absolutely need to fail the usage when someone imports a CPU cached buffer and then tries to use it as GL coherent on a non- snooping system. That simply will not work.

Exactly that, yes. That's what the attach callback is good for.

See we already have tons of cases where buffers can't be shared because they wasn't initially allocated in a way the importer can deal with them. But that's perfectly ok and intentional.

For example just take a configuration where a dedicated GPU clones the display with an integrated GPU. The dedicated GPU needs the image in local memory for scanout which is usually not accessible to the integrated GPU.

So either attaching the DMA-buf or creating the KMS framebuffer config will fail and we are running into the fallback path which involves an extra copy. And that is perfectly fine and intentional since this configuration is not supported by the hardware.

...

...
...
...
[SNIP]

...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

Sounds like I'm not making clear what I want to say here: For the exporter using cache coherent memory is optional, for the importer it isn't.

For the exporter it is perfectly valid to use kmalloc, get_free_page etc... on his buffers as long as it uses the DMA API to give the importer access to it.

And here is where our line of thought diverges: the DMA API allows snooping and non-snooping devices to work together just fine, as it has explicit domain transitions, which are no-ops if both devices are snooping, but will do the necessary cache maintenance when one of them is non-snooping but the memory is CPU cached.

I don't see why DMA-buf should be any different here. Yes, you can not support the "no domain transition" sharing when the memory is CPU cached and one of the devices in non-snooping, but you can support 99% of real use-cases like the non-snooped scanout or the UVC video import.

Well I didn't say we couldn't do it that way. What I'm saying that it was intentionally decided against it.

We could re-iterate that decision, but this would mean that all existing exporters would now need to provide additional functionality.

...

...
The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Why? The importer maps the dma-buf via dma_buf_map_attachment, which in most cases triggers a map via the DMA API on the exporter side. This map via the DMA API will already do the right thing in terms of cache management, it's just that we explicitly disable it via DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be cached, which violates the DMA API explicit domain transition anyway.

Why doesn't the importer simply calls dma_sync_sg_for_device() as necessary? See the importer does already know when it needs to access the buffer and as far as I can see has all the necessary variable to do the sync.

The exporter on the other hand doesn't know that. So we would need to transport this information.

Another fundamental problem is that the DMA API isn't designed for device to device transitions. In other words you have CPU->device and device->CPU transition, but not device->device. As far as I can see the DMA API should already have the necessary information if things like cache flushes are necessary or not.

...

...
Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

It's not just display drivers, video codec accelerators and most GPUs in this space are also non-snooping. In the ARM SoC world everyone just assumes you are non-snooping, which is why things work for most cases and only a handful like the UVC video import is broken.

That is really interesting to know, but I still think that DMA-buf was absolutely not designed for this use case.

From the point of view the primary reason for this was laptops with both dedicated and integrated GPUs, webcams etc...

That you have a huge number of ARM specific devices which can interop with themselves, but not with devices outside of their domain is not something foreseen here.

Regards, Christian.

...

...
Otherwise we would need to change all DMA-buf exporters to use a special function for allocation non-coherent memory and that is certainly not going to fly.

...
I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Well the key point is we intentionally rejected that design previously because it created all kind of trouble as well.

I would really like to know what issues popped up there. Moving the dma-buf attachment to work more like a buffer used with the DMA API seems like a good thing to me.

...
For this limited use case of doing a domain transition right before scanout it might make sense, but that's just one use case.

The only case I see that we still couldn't support with a change in that direction is the GL coherent access to a imported buffer that has been allocated from CPU cached memory on a system with non-snooping agents. Which to me sounds like a pretty niche use-case, but I would be happy to be proven wrong.

Regards, Lucas

Lucas Stach

8:10 a.m.

Am Freitag, dem 24.06.2022 um 08:54 +0200 schrieb Christian König:

...

Am 23.06.22 um 17:26 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:

...
Am 23.06.22 um 14:14 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

I fail to construct a case where you want the Vulkan/GL "no domain transition" coherent semantic without the allocator knowing about this. If you need this and the system is non-snooping, surely the allocator will choose uncached memory.

No it won't. The allocator in the exporter is independent of the importer.

That is an important and intentional design decision, cause otherwise you wouldn't have exporters/importers in the first place and rather a centralized allocation pool like what dma-heap implements.

See the purpose of DMA-buf is to expose the buffers in the way the exporter wants to expose them. So when the exporting driver wants to allocate normal cached system memory then that is perfectly fine and completely fits into this design.

I'm specifically talking about the case where a snooping exporter would allocate the GL coherent buffer and a non-snooping importer would need to access that buffer with the same "no domain transition needed" semantic. That is the thing which we can not make work at all and need to fail the attach. If both the exporter and importer are non-snooping you would probably get uncached memory, as long as the exporter knows how the buffer will be used. Is there a real use-case where the exporter doesn't know that the buffer will be used as GL/Vulkan coherent and we can't do fallback on the importer side?

...

Otherwise we would need to adjust all exporters to the importers, which is potentially not even possible.

...
I agree that you absolutely need to fail the usage when someone imports a CPU cached buffer and then tries to use it as GL coherent on a non- snooping system. That simply will not work.

Exactly that, yes. That's what the attach callback is good for.

See we already have tons of cases where buffers can't be shared because they wasn't initially allocated in a way the importer can deal with them. But that's perfectly ok and intentional.

For example just take a configuration where a dedicated GPU clones the display with an integrated GPU. The dedicated GPU needs the image in local memory for scanout which is usually not accessible to the integrated GPU.

So either attaching the DMA-buf or creating the KMS framebuffer config will fail and we are running into the fallback path which involves an extra copy. And that is perfectly fine and intentional since this configuration is not supported by the hardware.

...
...
...
...
[SNIP]

And here is where our line of thought diverges: the DMA API allows snooping and non-snooping devices to work together just fine, as it has explicit domain transitions, which are no-ops if both devices are snooping, but will do the necessary cache maintenance when one of them is non-snooping but the memory is CPU cached.

I don't see why DMA-buf should be any different here. Yes, you can not support the "no domain transition" sharing when the memory is CPU cached and one of the devices in non-snooping, but you can support 99% of real use-cases like the non-snooped scanout or the UVC video import.

Well I didn't say we couldn't do it that way. What I'm saying that it was intentionally decided against it.

We could re-iterate that decision, but this would mean that all existing exporters would now need to provide additional functionality.

The way I see it we would only need this for exporters that potentially export CPU cached memory, but need to interop with non-snooping importers. I guess that can be done on a case-by-case basis and wouldn't be a big flag day operation.

...

...
...
The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Why? The importer maps the dma-buf via dma_buf_map_attachment, which in most cases triggers a map via the DMA API on the exporter side. This map via the DMA API will already do the right thing in terms of cache management, it's just that we explicitly disable it via DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be cached, which violates the DMA API explicit domain transition anyway.

Why doesn't the importer simply calls dma_sync_sg_for_device() as necessary? See the importer does already know when it needs to access the buffer and as far as I can see has all the necessary variable to do the sync.

First, it wouldn't be symmetric with the dma_buf_map_attachment, where the actual dma_map_sg also happens on the exporter side.

Second, that is again a very x86 with PCI centric view. The importer flushing CPU caches by calling dma_sync_sg_for_device will only suffice in a world where devices are IO coherent, i.e. they snoop the CPU cache but don't participate fully in the system coherency due to never keeping dirty cache lines for buffers in system memory.

On fully coherent systems like ARM with AMBA CHI or x86 with CXL.cache all devices with access to the buffer can keep dirty cache lines in their device private caches, so any access from a non-snooping agent will require a cache clean on all those devices, which would basically require the the dma_buf_sync to be a broadcast operation to the exporter and all attached fully coherent importers.

...

The exporter on the other hand doesn't know that. So we would need to transport this information.

Another fundamental problem is that the DMA API isn't designed for device to device transitions. In other words you have CPU->device and device->CPU transition, but not device->device. As far as I can see the DMA API should already have the necessary information if things like cache flushes are necessary or not.

Don't you contradict the second part here with the first? The DMA API doesn't have the necessary information about needed cache cleaning on the exporters or other attached importers side, when you only call the dma_sync on the importer, which is exactly why I'm arguing for putting it in the dma_buf ops so we can do the necessary operations on other attached clients to make a device->device transition working reliably.

...

...
...
Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

It's not just display drivers, video codec accelerators and most GPUs in this space are also non-snooping. In the ARM SoC world everyone just assumes you are non-snooping, which is why things work for most cases and only a handful like the UVC video import is broken.

That is really interesting to know, but I still think that DMA-buf was absolutely not designed for this use case.

From the point of view the primary reason for this was laptops with both dedicated and integrated GPUs, webcams etc...

That you have a huge number of ARM specific devices which can interop with themselves, but not with devices outside of their domain is not something foreseen here.

Our recollection of history might differ here, but as Daniel remarked kind of snarkily, most of the initial contributors to dma-buf were from Linaro and TI, both of which were focused on getting device interop working on ARM devices, which at the time were overwhelmingly non- snooping. So I kind of doubt that dma-buf wasn't designed for this use- case.

Regards, Lucas

Thomas Zimmermann

23 Jun 23 Jun

8:13 a.m.

Hi Christian

Am 15.02.21 um 09:58 schrieb Christian König:

...

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

I had a patchset here that extends iosys-map (former dma-buf-map) with caching information. I'll post a copy.

Sorry for being late to reply.

Best regards Thomas

...

After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev

Christian König

8:26 a.m.

Am 23.06.22 um 10:13 schrieb Thomas Zimmermann:

...

Hi Christian

Am 15.02.21 um 09:58 schrieb Christian König:

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

I had a patchset here that extends iosys-map (former dma-buf-map) with caching information. I'll post a copy.

Oh, nice. But why on iosys-map? We need that per DMA-buf.

Thanks, Christian.

...

Sorry for being late to reply.

Best regards Thomas

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Thomas Zimmermann

8:42 a.m.

Am 23.06.22 um 10:26 schrieb Christian König:

...

Am 23.06.22 um 10:13 schrieb Thomas Zimmermann:

...
Hi Christian

Am 15.02.21 um 09:58 schrieb Christian König:

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

I had a patchset here that extends iosys-map (former dma-buf-map) with caching information. I'll post a copy.

Oh, nice. But why on iosys-map? We need that per DMA-buf.

It's returned via the dma-buf's vmap call within the iosys-map structure. If the dma-buf moves, the following vmap calls always return updated caching information. Maybe it's not quite what you need for Freesync?

I'll use this for format-conversion helpers, which do some optimizations for uncached memory.

Anyway, I have to look for that patch...

Best regards Thomas

...

Thanks, Christian.

...
Sorry for being late to reply.

Best regards Thomas

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev

1031

Age (days ago)

1525

Last active (days ago)

dri-devel@lists.freedesktop.org

43 comments

11 participants

tags (0)

participants (11)

Andy.Hsieh
Christian König
Christian König
Daniel Stone
Daniel Vetter
David Laight
Lucas Stach
Nicolas Dufresne
Pekka Paalanen
Simon Ser
Thomas Zimmermann