Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
But my concern is that it seems likely that non-cache coherent implementations are relying on this hack as well. There must be a reason that this hack is only disabled for PowerPC platforms if they are cache coherent, for instance, and I suspect that that reason is that the hack is the only thing ensuring that the CPU mapping attributes match the device ones used for these buffers (the vmap()ed ones), whereas the rings and other consistent data structures are using the DMA API as intended, and thus getting uncached attributes in the correct way.
Dave, who added that commit is on Cc together with just about everyone involved in the review chain. Based on the previous explanation that idea what we might want an uncached mapping for some non-coherent architectures for this to work at all makes sense, but then again the way to create those mappings is entirely architecture specific, and also need a cache flushing before creating the mapping to work properly. So my working theory is that this code never properly worked on architectures without DMA coherent for PCIe at all, but I'd love to be corrected by concrete examples including an explanation of how it actually ends up working.
Cache coherency is mandatory for modern GPU operation.
Otherwise you can't implement a bunch of the requirements of the userspace APIs.
In other words the applications doesn't inform the driver that the GPU or the CPU is accessing data, it just does it and assumes that it works.
Regards, Christian.
On Thu, 24 Jan 2019 at 10:25, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
But my concern is that it seems likely that non-cache coherent implementations are relying on this hack as well. There must be a reason that this hack is only disabled for PowerPC platforms if they are cache coherent, for instance, and I suspect that that reason is that the hack is the only thing ensuring that the CPU mapping attributes match the device ones used for these buffers (the vmap()ed ones), whereas the rings and other consistent data structures are using the DMA API as intended, and thus getting uncached attributes in the correct way.
Dave, who added that commit is on Cc together with just about everyone involved in the review chain. Based on the previous explanation that idea what we might want an uncached mapping for some non-coherent architectures for this to work at all makes sense, but then again the way to create those mappings is entirely architecture specific, and also need a cache flushing before creating the mapping to work properly. So my working theory is that this code never properly worked on architectures without DMA coherent for PCIe at all, but I'd love to be corrected by concrete examples including an explanation of how it actually ends up working.
Cache coherency is mandatory for modern GPU operation.
Otherwise you can't implement a bunch of the requirements of the userspace APIs.
In other words the applications doesn't inform the driver that the GPU or the CPU is accessing data, it just does it and assumes that it works.
Wonderful!
In that case, do you have any objections to the patch proposed by Christoph above?
Am 24.01.19 um 10:28 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 10:25, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
But my concern is that it seems likely that non-cache coherent implementations are relying on this hack as well. There must be a reason that this hack is only disabled for PowerPC platforms if they are cache coherent, for instance, and I suspect that that reason is that the hack is the only thing ensuring that the CPU mapping attributes match the device ones used for these buffers (the vmap()ed ones), whereas the rings and other consistent data structures are using the DMA API as intended, and thus getting uncached attributes in the correct way.
Dave, who added that commit is on Cc together with just about everyone involved in the review chain. Based on the previous explanation that idea what we might want an uncached mapping for some non-coherent architectures for this to work at all makes sense, but then again the way to create those mappings is entirely architecture specific, and also need a cache flushing before creating the mapping to work properly. So my working theory is that this code never properly worked on architectures without DMA coherent for PCIe at all, but I'd love to be corrected by concrete examples including an explanation of how it actually ends up working.
Cache coherency is mandatory for modern GPU operation.
Otherwise you can't implement a bunch of the requirements of the userspace APIs.
In other words the applications doesn't inform the driver that the GPU or the CPU is accessing data, it just does it and assumes that it works.
Wonderful!
In that case, do you have any objections to the patch proposed by Christoph above?
Yeah, the patch of Christoph actually goes way to far cause we have reports that this works on a bunch of other architectures.
E.g. X86 64bit, PowerPC (under some conditions) and some MIPS.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
Regards, Christian.
On Thu, 24 Jan 2019 at 10:45, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:28 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 10:25, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
But my concern is that it seems likely that non-cache coherent implementations are relying on this hack as well. There must be a reason that this hack is only disabled for PowerPC platforms if they are cache coherent, for instance, and I suspect that that reason is that the hack is the only thing ensuring that the CPU mapping attributes match the device ones used for these buffers (the vmap()ed ones), whereas the rings and other consistent data structures are using the DMA API as intended, and thus getting uncached attributes in the correct way.
Dave, who added that commit is on Cc together with just about everyone involved in the review chain. Based on the previous explanation that idea what we might want an uncached mapping for some non-coherent architectures for this to work at all makes sense, but then again the way to create those mappings is entirely architecture specific, and also need a cache flushing before creating the mapping to work properly. So my working theory is that this code never properly worked on architectures without DMA coherent for PCIe at all, but I'd love to be corrected by concrete examples including an explanation of how it actually ends up working.
Cache coherency is mandatory for modern GPU operation.
Otherwise you can't implement a bunch of the requirements of the userspace APIs.
In other words the applications doesn't inform the driver that the GPU or the CPU is accessing data, it just does it and assumes that it works.
Wonderful!
In that case, do you have any objections to the patch proposed by Christoph above?
Yeah, the patch of Christoph actually goes way to far cause we have reports that this works on a bunch of other architectures.
E.g. X86 64bit, PowerPC (under some conditions) and some MIPS.
This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
[SNIP] This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Christian.
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
[SNIP] This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
Thank you.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Again, this is not about non-cache coherent being slower without the optimization, it is about non-cache coherent likely not working *at all* unless the optimization is enabled.
Otherwise, the driver will vmap() DMA pages with cacheable attributes, while the non-cache coherent device uses uncached attributes, breaking coherency.
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
[SNIP] This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
Thank you.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Again, this is not about non-cache coherent being slower without the optimization, it is about non-cache coherent likely not working *at all* unless the optimization is enabled.
As Michel tried to explain this CAN'T happen. The optimization is a purely optional request from userspace.
Otherwise, the driver will vmap() DMA pages with cacheable attributes, while the non-cache coherent device uses uncached attributes, breaking coherency.
Again this is mandated by the userspace APIs anyway. E.g. we can't vmap() pages in any other way or our userspace APIs would break.
Regards, Christian.
On Thu, 24 Jan 2019 at 12:37, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
[SNIP] This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
Thank you.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Again, this is not about non-cache coherent being slower without the optimization, it is about non-cache coherent likely not working *at all* unless the optimization is enabled.
As Michel tried to explain this CAN'T happen. The optimization is a purely optional request from userspace.
Right.
So in that case, we can assume that the following test
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
is bogus, and it was just unnecessary caution on the part of the author to disregard non-cache coherent devices. Unfortunately, those commits have no log messages whatsoever, so it is difficult to infer the intent retroactively.
Otherwise, the driver will vmap() DMA pages with cacheable attributes, while the non-cache coherent device uses uncached attributes, breaking coherency.
Again this is mandated by the userspace APIs anyway. E.g. we can't vmap() pages in any other way or our userspace APIs would break.
OK,
So let's just disable this for all ARM and arm64 then, given that non-cache coherent is not supported in any case
On Thu, Jan 24, 2019 at 6:45 AM Ard Biesheuvel ard.biesheuvel@linaro.org wrote:
On Thu, 24 Jan 2019 at 12:37, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
[SNIP] This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
Thank you.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Again, this is not about non-cache coherent being slower without the optimization, it is about non-cache coherent likely not working *at all* unless the optimization is enabled.
As Michel tried to explain this CAN'T happen. The optimization is a purely optional request from userspace.
Right.
So in that case, we can assume that the following test
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
is bogus, and it was just unnecessary caution on the part of the author to disregard non-cache coherent devices. Unfortunately, those commits have no log messages whatsoever, so it is difficult to infer the intent retroactively.
Otherwise, the driver will vmap() DMA pages with cacheable attributes, while the non-cache coherent device uses uncached attributes, breaking coherency.
Again this is mandated by the userspace APIs anyway. E.g. we can't vmap() pages in any other way or our userspace APIs would break.
OK,
So let's just disable this for all ARM and arm64 then, given that non-cache coherent is not supported in any case
So I think we are back to this patch: https://patchwork.kernel.org/patch/10739023/
Alex
On Thu, 24 Jan 2019 at 14:54, Alex Deucher alexdeucher@gmail.com wrote:
On Thu, Jan 24, 2019 at 6:45 AM Ard Biesheuvel ard.biesheuvel@linaro.org wrote:
On Thu, 24 Jan 2019 at 12:37, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
[SNIP] This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
> The only problematic here actually seems to be ARM, so you should > probably just add an "#ifdef .._ARM return false;". > ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
Thank you.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Again, this is not about non-cache coherent being slower without the optimization, it is about non-cache coherent likely not working *at all* unless the optimization is enabled.
As Michel tried to explain this CAN'T happen. The optimization is a purely optional request from userspace.
Right.
So in that case, we can assume that the following test
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
is bogus, and it was just unnecessary caution on the part of the author to disregard non-cache coherent devices. Unfortunately, those commits have no log messages whatsoever, so it is difficult to infer the intent retroactively.
Otherwise, the driver will vmap() DMA pages with cacheable attributes, while the non-cache coherent device uses uncached attributes, breaking coherency.
Again this is mandated by the userspace APIs anyway. E.g. we can't vmap() pages in any other way or our userspace APIs would break.
OK,
So let's just disable this for all ARM and arm64 then, given that non-cache coherent is not supported in any case
So I think we are back to this patch: https://patchwork.kernel.org/patch/10739023/
Apart from the fact that the issue has nothing to do with write-combining, yes.
On Thu, Jan 24, 2019 at 8:57 AM Ard Biesheuvel ard.biesheuvel@linaro.org wrote:
On Thu, 24 Jan 2019 at 14:54, Alex Deucher alexdeucher@gmail.com wrote:
On Thu, Jan 24, 2019 at 6:45 AM Ard Biesheuvel ard.biesheuvel@linaro.org wrote:
On Thu, 24 Jan 2019 at 12:37, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel: > [SNIP] > This is *exactly* my point the whole time. > > The current code has > > static inline bool drm_arch_can_wc_memory(void) > { > #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) > return false; > > which means the optimization is disabled *unless the system is > non-cache coherent* > > So if you have reports that the optimization works on some PowerPC, it > must be non-cache coherent PowerPC, because that is the only place > where it is enabled in the first place. > >> The only problematic here actually seems to be ARM, so you should >> probably just add an "#ifdef .._ARM return false;". >> > ARM/arm64 does not have a Kconfig symbol like > CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If > there are non-coherent ARM systems that are currently working in the > same way as those non-coherent PowerPC systems, we will break them by > doing this. Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
Thank you.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Again, this is not about non-cache coherent being slower without the optimization, it is about non-cache coherent likely not working *at all* unless the optimization is enabled.
As Michel tried to explain this CAN'T happen. The optimization is a purely optional request from userspace.
Right.
So in that case, we can assume that the following test
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
is bogus, and it was just unnecessary caution on the part of the author to disregard non-cache coherent devices. Unfortunately, those commits have no log messages whatsoever, so it is difficult to infer the intent retroactively.
Otherwise, the driver will vmap() DMA pages with cacheable attributes, while the non-cache coherent device uses uncached attributes, breaking coherency.
Again this is mandated by the userspace APIs anyway. E.g. we can't vmap() pages in any other way or our userspace APIs would break.
OK,
So let's just disable this for all ARM and arm64 then, given that non-cache coherent is not supported in any case
So I think we are back to this patch: https://patchwork.kernel.org/patch/10739023/
Apart from the fact that the issue has nothing to do with write-combining, yes.
Your patch has a better description. Let's go with that.
Alex
On 2019-01-24 12:45 p.m., Ard Biesheuvel wrote:
On Thu, 24 Jan 2019 at 12:37, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian Christian.Koenig@amd.com wrote:
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
[SNIP] This is *exactly* my point the whole time.
The current code has
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
which means the optimization is disabled *unless the system is non-cache coherent*
So if you have reports that the optimization works on some PowerPC, it must be non-cache coherent PowerPC, because that is the only place where it is enabled in the first place.
The only problematic here actually seems to be ARM, so you should probably just add an "#ifdef .._ARM return false;".
ARM/arm64 does not have a Kconfig symbol like CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If there are non-coherent ARM systems that are currently working in the same way as those non-coherent PowerPC systems, we will break them by doing this.
Summing the things I've read so far for ARM up I actually think it depends on a runtime configuration and not on compile time one.
So the whole idea of providing the device to the drm_*_can_wc_memory() function isn't so far fetched.
Thank you.
But for now I do prefer working and slightly slower system over broken one, so I think we should just disable this on ARM for now.
Again, this is not about non-cache coherent being slower without the optimization, it is about non-cache coherent likely not working *at all* unless the optimization is enabled.
As Michel tried to explain this CAN'T happen. The optimization is a purely optional request from userspace.
Right.
So in that case, we can assume that the following test
static inline bool drm_arch_can_wc_memory(void) { #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE) return false;
is bogus, and it was just unnecessary caution on the part of the author to disregard non-cache coherent devices.
This is driver-independent core code, meaning "non-snooped PCIe transfers don't work on cache coherent PPC". It doesn't imply anything about whether or not amdgpu or any other driver works on non-cache-coherent PPC in general.
dri-devel@lists.freedesktop.org