Hi all,
I polished up my original proof-of-concept a little while back, but now that I've got my hands on my Juno again I've been able to actually test it to my satisfaction, so here are proper patches!
It probably makes sense for patches #1 and #2 to stay together and both go via drm-misc, provided Will's OK with that.
Robin.
Robin Murphy (3): iommu/io-pgtable-arm: Support coherency for Mali LPAE drm/panfrost: Support cache-coherent integrations arm64: dts: meson: Describe G12b GPU as coherent
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + drivers/iommu/io-pgtable-arm.c | 5 ++++- 6 files changed, 14 insertions(+), 1 deletion(-)
Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.
Reviewed-by: Steven Price steven.price@arm.com Signed-off-by: Robin Murphy robin.murphy@arm.com --- drivers/iommu/io-pgtable-arm.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index dc7bcf858b6d..e47012006dcc 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -440,7 +440,7 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, << ARM_LPAE_PTE_ATTRINDX_SHIFT); }
- if (prot & IOMMU_CACHE) + if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE) pte |= ARM_LPAE_PTE_SH_IS; else pte |= ARM_LPAE_PTE_SH_OS; @@ -1049,6 +1049,9 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) cfg->arm_mali_lpae_cfg.transtab = virt_to_phys(data->pgd) | ARM_MALI_LPAE_TTBR_READ_INNER | ARM_MALI_LPAE_TTBR_ADRMODE_TABLE; + if (cfg->coherent_walk) + cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER; + return &data->iop;
out_free_data:
On Wed, Sep 16, 2020 at 12:51:05AM +0100, Robin Murphy wrote:
Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.
I can't help but think some of this commentary deserves to be in the code as well.
Do you know if this sort of thing is done for other SoCs too, or is this just a Juno quirk?
Will
On 2020-09-21 18:57, Will Deacon wrote:
On Wed, Sep 16, 2020 at 12:51:05AM +0100, Robin Murphy wrote:
Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.
I can't help but think some of this commentary deserves to be in the code as well.
Sure, if you want.
Do you know if this sort of thing is done for other SoCs too, or is this just a Juno quirk?
Yup, this is a "Midgard working as designed" thing. Juno is the coherent example I have to hand, but off the top of my head I believe some of the Exynos SoCs can also use their GPUs coherently if a switch is flipped in the interconnect to change routing between the CCI and a direct-to-RAM path; I expect there are probably further Midgard examples that I'm not aware of. Then there are definitely coherent Bifrost GPUs like the Amlogic S922/A311 that prompted me to revive this patch, which we currently drive in "Legacy" mode and thus behave the same way as Midgard (Bifrost's "AArch64" mode realigns Ish and Osh with the rest of the system, and instead invents a new "Internal Shareable" value in between Nsh and Ish to represent the shareability between cores within the GPU for which Midgard hijacked Ish).
Robin.
On Mon, Sep 21, 2020 at 10:53:23PM +0100, Robin Murphy wrote:
On 2020-09-21 18:57, Will Deacon wrote:
On Wed, Sep 16, 2020 at 12:51:05AM +0100, Robin Murphy wrote:
Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.
I can't help but think some of this commentary deserves to be in the code as well.
Sure, if you want.
Yes, please.
Do you know if this sort of thing is done for other SoCs too, or is this just a Juno quirk?
Yup, this is a "Midgard working as designed" thing. Juno is the coherent example I have to hand, but off the top of my head I believe some of the Exynos SoCs can also use their GPUs coherently if a switch is flipped in the interconnect to change routing between the CCI and a direct-to-RAM path; I expect there are probably further Midgard examples that I'm not aware of. Then there are definitely coherent Bifrost GPUs like the Amlogic S922/A311 that prompted me to revive this patch, which we currently drive in "Legacy" mode and thus behave the same way as Midgard (Bifrost's "AArch64" mode realigns Ish and Osh with the rest of the system, and instead invents a new "Internal Shareable" value in between Nsh and Ish to represent the shareability between cores within the GPU for which Midgard hijacked Ish).
That is more than I wanted to know :) "Internal Shareable", jeez...
Thanks,
Will
When the GPU's ACE-Lite interface is fully wired up and capable of snooping CPU caches, it may be described as "dma-coherent" in devicetree, which will already inform the DMA layer not to perform unnecessary cache maintenance. However, we still need to ensure that the GPU uses the appropriate cacheable outer-shareable attributes in order to generate the requisite snoop signals, and that CPU mappings don't create a mismatch by using a non-cacheable type either.
Signed-off-by: Robin Murphy robin.murphy@arm.com --- drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + 4 files changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h index c30c719a8059..b31f45315e96 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.h +++ b/drivers/gpu/drm/panfrost/panfrost_device.h @@ -84,6 +84,7 @@ struct panfrost_device { /* pm_domains for devices with more than one. */ struct device *pm_domain_devs[MAX_PM_DOMAINS]; struct device_link *pm_domain_links[MAX_PM_DOMAINS]; + bool coherent;
struct panfrost_features features; const struct panfrost_compatible *comp; diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index ada51df9a7a3..2a6f2f716b2f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -588,6 +588,8 @@ static int panfrost_probe(struct platform_device *pdev) if (!pfdev->comp) return -ENODEV;
+ pfdev->coherent = device_get_dma_attr(&pdev->dev) == DEV_DMA_COHERENT; + /* Allocate and initialze the DRM device. */ ddev = drm_dev_alloc(&panfrost_drm_driver, &pdev->dev); if (IS_ERR(ddev)) diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c index 33355dd302f1..cdf1a8754eba 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.c +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c @@ -220,6 +220,7 @@ static const struct drm_gem_object_funcs panfrost_gem_funcs = { */ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size) { + struct panfrost_device *pfdev = dev->dev_private; struct panfrost_gem_object *obj;
obj = kzalloc(sizeof(*obj), GFP_KERNEL); @@ -229,6 +230,7 @@ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t INIT_LIST_HEAD(&obj->mappings.list); mutex_init(&obj->mappings.lock); obj->base.base.funcs = &panfrost_gem_funcs; + obj->base.map_cached = pfdev->coherent;
return &obj->base.base; } diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index e8f7b11352d2..8852fd378f7a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -371,6 +371,7 @@ int panfrost_mmu_pgtable_alloc(struct panfrost_file_priv *priv) .pgsize_bitmap = SZ_4K | SZ_2M, .ias = FIELD_GET(0xff, pfdev->features.mmu_features), .oas = FIELD_GET(0xff00, pfdev->features.mmu_features), + .coherent_walk = pfdev->coherent, .tlb = &mmu_tlb_ops, .iommu_dev = pfdev->dev, };
On 16/09/2020 00:51, Robin Murphy wrote:
When the GPU's ACE-Lite interface is fully wired up and capable of snooping CPU caches, it may be described as "dma-coherent" in devicetree, which will already inform the DMA layer not to perform unnecessary cache maintenance. However, we still need to ensure that the GPU uses the appropriate cacheable outer-shareable attributes in order to generate the requisite snoop signals, and that CPU mappings don't create a mismatch by using a non-cacheable type either.
Signed-off-by: Robin Murphy robin.murphy@arm.com
LGTM
Reviewed-by: Steven Price steven.price@arm.com
drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + 4 files changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h index c30c719a8059..b31f45315e96 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.h +++ b/drivers/gpu/drm/panfrost/panfrost_device.h @@ -84,6 +84,7 @@ struct panfrost_device { /* pm_domains for devices with more than one. */ struct device *pm_domain_devs[MAX_PM_DOMAINS]; struct device_link *pm_domain_links[MAX_PM_DOMAINS];
bool coherent;
struct panfrost_features features; const struct panfrost_compatible *comp;
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index ada51df9a7a3..2a6f2f716b2f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -588,6 +588,8 @@ static int panfrost_probe(struct platform_device *pdev) if (!pfdev->comp) return -ENODEV;
- pfdev->coherent = device_get_dma_attr(&pdev->dev) == DEV_DMA_COHERENT;
- /* Allocate and initialze the DRM device. */ ddev = drm_dev_alloc(&panfrost_drm_driver, &pdev->dev); if (IS_ERR(ddev))
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c index 33355dd302f1..cdf1a8754eba 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.c +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c @@ -220,6 +220,7 @@ static const struct drm_gem_object_funcs panfrost_gem_funcs = { */ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size) {
struct panfrost_device *pfdev = dev->dev_private; struct panfrost_gem_object *obj;
obj = kzalloc(sizeof(*obj), GFP_KERNEL);
@@ -229,6 +230,7 @@ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t INIT_LIST_HEAD(&obj->mappings.list); mutex_init(&obj->mappings.lock); obj->base.base.funcs = &panfrost_gem_funcs;
obj->base.map_cached = pfdev->coherent;
return &obj->base.base; }
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index e8f7b11352d2..8852fd378f7a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -371,6 +371,7 @@ int panfrost_mmu_pgtable_alloc(struct panfrost_file_priv *priv) .pgsize_bitmap = SZ_4K | SZ_2M, .ias = FIELD_GET(0xff, pfdev->features.mmu_features), .oas = FIELD_GET(0xff00, pfdev->features.mmu_features),
.tlb = &mmu_tlb_ops, .iommu_dev = pfdev->dev, };.coherent_walk = pfdev->coherent,
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.
Signed-off-by: Robin Murphy robin.murphy@arm.com --- arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; }; + +&mali { + dma-coherent; +};
Hi Robin,
On 16/09/2020 01:51, Robin Murphy wrote:
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.
Signed-off-by: Robin Murphy robin.murphy@arm.com
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };
+&mali {
- dma-coherent;
+};
Thanks a lot for digging, I'll run a test to confirm it fixes the issue !
Neil
Hi Robin, Neil,
On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:
Hi Robin,
On 16/09/2020 01:51, Robin Murphy wrote:
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.
Signed-off-by: Robin Murphy robin.murphy@arm.com
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };
+&mali {
- dma-coherent;
+};
Thanks a lot for digging, I'll run a test to confirm it fixes the issue !
Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.
[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---
On 05/10/2020 09:15, Boris Brezillon wrote:
Hi Robin, Neil,
On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:
Hi Robin,
On 16/09/2020 01:51, Robin Murphy wrote:
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.
Signed-off-by: Robin Murphy robin.murphy@arm.com
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };
+&mali {
- dma-coherent;
+};
Thanks a lot for digging, I'll run a test to confirm it fixes the issue !
Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.
[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---
It's quite possible this is caused by the GPU seeing a stale page table entry, so perhaps coherency isn't working as well as it should...
Do you get an "Unhandled Page fault" message after this? It might give some clues. Coherency issues are a pain to debug though and it's of course possible there are issues on this specific platform.
Steve
On Mon, 5 Oct 2020 09:34:06 +0100 Steven Price steven.price@arm.com wrote:
On 05/10/2020 09:15, Boris Brezillon wrote:
Hi Robin, Neil,
On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:
Hi Robin,
On 16/09/2020 01:51, Robin Murphy wrote:
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.
Signed-off-by: Robin Murphy robin.murphy@arm.com
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };
+&mali {
- dma-coherent;
+};
Thanks a lot for digging, I'll run a test to confirm it fixes the issue !
Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.
[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---
It's quite possible this is caused by the GPU seeing a stale page table entry, so perhaps coherency isn't working as well as it should...
Do you get an "Unhandled Page fault" message after this?
Yep (see below).
--->8---
[ 689.640491] ------------[ cut here ]------------ [ 689.644754] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 3146000) [ 689.644802] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.663607] Modules linked in: [ 689.666631] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 689.676717] Hardware name: Khadas VIM3 (DT) [ 689.680860] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 689.686380] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.691987] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.697590] sp : ffff800011bcbcd0 [ 689.700867] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 689.706128] x27: ffff0000d89cf750 x26: ffff0000da34a800 [ 689.711389] x25: ffff0000f32409c0 x24: 0000000000000001 [ 689.716650] x23: ffff0000f3240880 x22: ffff0000d456e000 [ 689.721911] x21: 0000000000000000 x20: 0000000000000000 [ 689.727173] x19: 0000000000010001 x18: 0000000000000020 [ 689.732434] x17: 0000000000000000 x16: 0000000000000000 [ 689.737695] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 689.742957] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 689.748218] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 689.753479] x9 : 00000000fffffffe x8 : 0000000000000000 [ 689.758740] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 689.764001] x5 : 0000000000000000 x4 : 0000000000000000 [ 689.769263] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 689.774524] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 689.779786] Call trace: [ 689.782204] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.787465] irq_thread_fn+0x2c/0xa0 [ 689.790999] irq_thread+0x184/0x208 [ 689.794451] kthread+0x140/0x160 [ 689.797642] ret_from_fork+0x10/0x34 [ 689.801177] ---[ end trace b5dd8c2dfada8235 ]--- [ 689.805864] panfrost ffe40000.gpu: Unhandled Page fault in AS0 at VA 0x0000000003146080 [ 689.805864] Reason: TODO [ 689.805864] raw fault status: 0x10003C3 [ 689.805864] decoded fault status: SLAVE FAULT [ 689.805864] exception type 0xC3: TRANSLATION_FAULT_LEVEL3 [ 689.805864] access type 0x3: WRITE [ 689.805864] source id 0x100 [ 690.170419] panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x8, head=0x3101100, tail=0x3101100, sched_job=000000004b442768 [ 690.770373] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945123] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945731] ------------[ cut here ]------------
On 05/10/2020 09:39, Boris Brezillon wrote:
On Mon, 5 Oct 2020 09:34:06 +0100 Steven Price steven.price@arm.com wrote:
On 05/10/2020 09:15, Boris Brezillon wrote:
Hi Robin, Neil,
On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:
Hi Robin,
On 16/09/2020 01:51, Robin Murphy wrote:
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.
Signed-off-by: Robin Murphy robin.murphy@arm.com
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };
+&mali {
- dma-coherent;
+};
Thanks a lot for digging, I'll run a test to confirm it fixes the issue !
Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.
[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---
It's quite possible this is caused by the GPU seeing a stale page table entry, so perhaps coherency isn't working as well as it should...
Do you get an "Unhandled Page fault" message after this?
Yep (see below).
--->8---
[...]
[ 689.805864] panfrost ffe40000.gpu: Unhandled Page fault in AS0 at VA 0x0000000003146080 [ 689.805864] Reason: TODO [ 689.805864] raw fault status: 0x10003C3 [ 689.805864] decoded fault status: SLAVE FAULT [ 689.805864] exception type 0xC3: TRANSLATION_FAULT_LEVEL3 [ 689.805864] access type 0x3: WRITE [ 689.805864] source id 0x100 [ 690.170419] panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x8, head=0x3101100, tail=0x3101100, sched_job=000000004b442768 [ 690.770373] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945123] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945731] ------------[ cut here ]------------
That's a write fault from level 3 of the page table triggered by shader core 0 in a fragment job. So could be writing out the frame buffer.
It would be interesting to see if a patch like below would work round it:
----8<---- diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index e8f7b11352d2..5144860afdea 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -460,9 +460,12 @@ static int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as,
bo = bomapping->obj; if (!bo->is_heap) { - dev_WARN(pfdev->dev, "matching BO is not heap type (GPU VA = %llx)", + dev_err(pfdev->dev, "matching BO is not heap type (GPU VA = %llx)", bomapping->mmnode.start << PAGE_SHIFT); - ret = -EINVAL; + /* Force a flush of the MMU to restart the transaction */ + mmu_hw_do_operation(pfdev, bomapping->mmu, addr, 0, + AS_COMMAND_FLUSH_MEM); + ret = 0; goto err_bo; } WARN_ON(bomapping->mmu->as != as); ---8<---
That's obviously not the correct solution (the fault shouldn't occur), but if flushing the GPU's caches and retrying works then we know it's a coherency issue. You can also try backing off from the big hammer of AS_COMMAND_FLUSH_MEM and trying AS_COMMAND_FLUSH_PT or AS_COMMAND_UNLOCK.
Steve
On 16/09/2020 01:51, Robin Murphy wrote:
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.
Signed-off-by: Robin Murphy robin.murphy@arm.com
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };
+&mali {
- dma-coherent;
+};
Reviewed-by: Neil Armstrong narmstrong@baylibre.com
Hi Robin,
On 16/09/2020 01:51, Robin Murphy wrote:
Hi all,
I polished up my original proof-of-concept a little while back, but now that I've got my hands on my Juno again I've been able to actually test it to my satisfaction, so here are proper patches!
I tested on the Kkadas VIM3, and yes it fixes the random FAULTS I have *without*: [ 152.417127] panfrost ffe40000.gpu: gpu sched timeout, js=0, config=0x7300, status=0x58, head=0x3091400, tail=0x3091400, sched_job=000000004d83c2d7 [ 152.530928] panfrost ffe40000.gpu: js fault, js=1, status=INSTR_INVALID_ENC, head=0x30913c0, tail=0x30913c0 [ 152.539797] panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x51, head=0x30913c0, tail=0x30913c0, sched_job=0000000038cecaf6 [ 156.943505] panfrost ffe40000.gpu: js fault, js=0, status=TILE_RANGE_FAULT, head=0x3091400, tail=0x3091400
but, with this patchset, I get the following fps with kmscube: Rendered 97 frames in 2.016291 sec (48.108145 fps) Rendered 206 frames in 4.016723 sec (51.285584 fps) Rendered 316 frames in 6.017208 sec (52.516052 fps) Rendered 430 frames in 8.017456 sec (53.632975 fps)
but when I resurrect my BROKEN_NS patchset (simply disabling shareability), I get: Rendered 120 frames in 2.000143 sec (59.995724 fps) Rendered 241 frames in 4.016760 sec (59.998605 fps) Rendered 362 frames in 6.033443 sec (59.998911 fps) Rendered 482 frames in 8.033531 sec (59.998522 fps)
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.
So: Tested-by: Neil Armstrong narmstrong@baylibre.com
Neil
It probably makes sense for patches #1 and #2 to stay together and both go via drm-misc, provided Will's OK with that.
Robin.
Robin Murphy (3): iommu/io-pgtable-arm: Support coherency for Mali LPAE drm/panfrost: Support cache-coherent integrations arm64: dts: meson: Describe G12b GPU as coherent
arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + drivers/iommu/io-pgtable-arm.c | 5 ++++- 6 files changed, 14 insertions(+), 1 deletion(-)
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.
That's bizarre -- this should really be the faster of the two.
Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.
Rob
On 16/09/2020 18:46, Rob Herring wrote:
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.
That's bizarre -- this should really be the faster of the two.
Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.
The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.
Steve
On 9/17/20 12:38 PM, Steven Price wrote:
On 16/09/2020 18:46, Rob Herring wrote:
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.
That's bizarre -- this should really be the faster of the two.
Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.
The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.
Or maybe it's only a problem for applications that do silly things? I don't think kmscube was ever optimized for performance.
Regards,
Tomeu
On 17/09/2020 11:51, Tomeu Vizoso wrote:
On 9/17/20 12:38 PM, Steven Price wrote:
On 16/09/2020 18:46, Rob Herring wrote:
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.
That's bizarre -- this should really be the faster of the two.
Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.
The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.
Or maybe it's only a problem for applications that do silly things? I don't think kmscube was ever optimized for performance.
Well doing silly things is almost the definition of a benchmark ;) A lot of the mobile graphics benchmarks suffer from not being very intelligent in how they render (e.g. many have meshes that are far too detailed so the triangles are smaller than the pixels).
Of course there are also applications that get things wrong too.
Steve
The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.
Right, Panfrost userspace specifically assumes GPU reads to be expensive and treats GPU memory as write-only *except* for a few special cases (compute-like workloads, glReadPixels, some blits, etc).
The vast majority of the GPU memory - everything used in kmscube - will be write-only to the CPU and fed directly into the display zero-copy (or read by the GPU later as a dmabuf).
On 2020-09-16 18:46, Rob Herring wrote:
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.
That's bizarre -- this should really be the faster of the two.
Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.
There will still be CPU benefits in terms of not having to spend time cache-cleaning every BO upon allocation, and less overhead on writing out descriptors in the first place (due to cacheable vs. non-cacheable).
I haven't tried the NSh hack on Juno, but I don't see any notable performance issue as-is - kmscube hits a solid 60FPS from the off (now that it works without spewing faults). Given that the hardware on Juno can be generously described as "less good", it would certainly be interesting to figure out what difference is at play here...
The usual argument against I/O coherency is that it adds latency to every access, but if you already have a coherent interconnect anyway then the sensible answer to that is implementing decent snoop filters, rather than making software more complicated.
Robin.
dri-devel@lists.freedesktop.org