[PATCH 0/3] drm: panfrost: Coherency support

List overview All Threads
Download

newer

older

[PATCH] drm/gma500: fix double...

[PATCH 1/2] panfrost: Make sure...

Robin Murphy

15 Sep 2020 15 Sep '20

11:51 p.m.

Hi all,

I polished up my original proof-of-concept a little while back, but now that I've got my hands on my Juno again I've been able to actually test it to my satisfaction, so here are proper patches!

It probably makes sense for patches #1 and #2 to stay together and both go via drm-misc, provided Will's OK with that.

Robin.

Robin Murphy (3): iommu/io-pgtable-arm: Support coherency for Mali LPAE drm/panfrost: Support cache-coherent integrations arm64: dts: meson: Describe G12b GPU as coherent

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + drivers/iommu/io-pgtable-arm.c | 5 ++++- 6 files changed, 14 insertions(+), 1 deletion(-)

-- 2.28.0.dirty

Show replies by date

Robin Murphy

15 Sep 15 Sep

11:51 p.m.

New subject: [PATCH 1/3] iommu/io-pgtable-arm: Support coherency for Mali LPAE

Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.

Reviewed-by: Steven Price steven.price@arm.com Signed-off-by: Robin Murphy robin.murphy@arm.com --- drivers/iommu/io-pgtable-arm.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index dc7bcf858b6d..e47012006dcc 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -440,7 +440,7 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, << ARM_LPAE_PTE_ATTRINDX_SHIFT); }

- if (prot & IOMMU_CACHE) + if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE) pte |= ARM_LPAE_PTE_SH_IS; else pte |= ARM_LPAE_PTE_SH_OS; @@ -1049,6 +1049,9 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) cfg->arm_mali_lpae_cfg.transtab = virt_to_phys(data->pgd) | ARM_MALI_LPAE_TTBR_READ_INNER | ARM_MALI_LPAE_TTBR_ADRMODE_TABLE; + if (cfg->coherent_walk) + cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER; + return &data->iop;

out_free_data:

-- 2.28.0.dirty

Will Deacon

21 Sep 21 Sep

5:57 p.m.

New subject: [PATCH 1/3] iommu/io-pgtable-arm: Support coherency for Mali LPAE

On Wed, Sep 16, 2020 at 12:51:05AM +0100, Robin Murphy wrote:

...

Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.

I can't help but think some of this commentary deserves to be in the code as well.

Do you know if this sort of thing is done for other SoCs too, or is this just a Juno quirk?

Will

Robin Murphy

9:53 p.m.

New subject: [PATCH 1/3] iommu/io-pgtable-arm: Support coherency for Mali LPAE

On 2020-09-21 18:57, Will Deacon wrote:

...

On Wed, Sep 16, 2020 at 12:51:05AM +0100, Robin Murphy wrote:

...
Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.

I can't help but think some of this commentary deserves to be in the code as well.

Sure, if you want.

...

Do you know if this sort of thing is done for other SoCs too, or is this just a Juno quirk?

Yup, this is a "Midgard working as designed" thing. Juno is the coherent example I have to hand, but off the top of my head I believe some of the Exynos SoCs can also use their GPUs coherently if a switch is flipped in the interconnect to change routing between the CCI and a direct-to-RAM path; I expect there are probably further Midgard examples that I'm not aware of. Then there are definitely coherent Bifrost GPUs like the Amlogic S922/A311 that prompted me to revive this patch, which we currently drive in "Legacy" mode and thus behave the same way as Midgard (Bifrost's "AArch64" mode realigns Ish and Osh with the rest of the system, and instead invents a new "Internal Shareable" value in between Nsh and Ish to represent the shareability between cores within the GPU for which Midgard hijacked Ish).

Robin.

Will Deacon

10:24 p.m.

New subject: [PATCH 1/3] iommu/io-pgtable-arm: Support coherency for Mali LPAE

On Mon, Sep 21, 2020 at 10:53:23PM +0100, Robin Murphy wrote:

...

On 2020-09-21 18:57, Will Deacon wrote:

...
On Wed, Sep 16, 2020 at 12:51:05AM +0100, Robin Murphy wrote:

...
Midgard GPUs have ACE-Lite master interfaces which allows systems to integrate them in an I/O-coherent manner. It seems that from the GPU's viewpoint, the rest of the system is its outer shareable domain, and so even when snoop signals are wired up, they are only emitted for outer shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does indeed get coherent pagetable walks working nicely for the coherent T620 in the Arm Juno SoC.

I can't help but think some of this commentary deserves to be in the code as well.

Sure, if you want.

Yes, please.

...

...
Do you know if this sort of thing is done for other SoCs too, or is this just a Juno quirk?

Yup, this is a "Midgard working as designed" thing. Juno is the coherent example I have to hand, but off the top of my head I believe some of the Exynos SoCs can also use their GPUs coherently if a switch is flipped in the interconnect to change routing between the CCI and a direct-to-RAM path; I expect there are probably further Midgard examples that I'm not aware of. Then there are definitely coherent Bifrost GPUs like the Amlogic S922/A311 that prompted me to revive this patch, which we currently drive in "Legacy" mode and thus behave the same way as Midgard (Bifrost's "AArch64" mode realigns Ish and Osh with the rest of the system, and instead invents a new "Internal Shareable" value in between Nsh and Ish to represent the shareability between cores within the GPU for which Midgard hijacked Ish).

That is more than I wanted to know :) "Internal Shareable", jeez...

Thanks,

Will

Robin Murphy

15 Sep 15 Sep

11:51 p.m.

New subject: [PATCH 2/3] drm/panfrost: Support cache-coherent integrations

When the GPU's ACE-Lite interface is fully wired up and capable of snooping CPU caches, it may be described as "dma-coherent" in devicetree, which will already inform the DMA layer not to perform unnecessary cache maintenance. However, we still need to ensure that the GPU uses the appropriate cacheable outer-shareable attributes in order to generate the requisite snoop signals, and that CPU mappings don't create a mismatch by using a non-cacheable type either.

Signed-off-by: Robin Murphy robin.murphy@arm.com --- drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + 4 files changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h index c30c719a8059..b31f45315e96 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.h +++ b/drivers/gpu/drm/panfrost/panfrost_device.h @@ -84,6 +84,7 @@ struct panfrost_device { /* pm_domains for devices with more than one. */ struct device *pm_domain_devs[MAX_PM_DOMAINS]; struct device_link *pm_domain_links[MAX_PM_DOMAINS]; + bool coherent;

struct panfrost_features features; const struct panfrost_compatible *comp; diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index ada51df9a7a3..2a6f2f716b2f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -588,6 +588,8 @@ static int panfrost_probe(struct platform_device *pdev) if (!pfdev->comp) return -ENODEV;

+ pfdev->coherent = device_get_dma_attr(&pdev->dev) == DEV_DMA_COHERENT; + /* Allocate and initialze the DRM device. */ ddev = drm_dev_alloc(&panfrost_drm_driver, &pdev->dev); if (IS_ERR(ddev)) diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c index 33355dd302f1..cdf1a8754eba 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.c +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c @@ -220,6 +220,7 @@ static const struct drm_gem_object_funcs panfrost_gem_funcs = { */ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size) { + struct panfrost_device *pfdev = dev->dev_private; struct panfrost_gem_object *obj;

obj = kzalloc(sizeof(*obj), GFP_KERNEL); @@ -229,6 +230,7 @@ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t INIT_LIST_HEAD(&obj->mappings.list); mutex_init(&obj->mappings.lock); obj->base.base.funcs = &panfrost_gem_funcs; + obj->base.map_cached = pfdev->coherent;

return &obj->base.base; } diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index e8f7b11352d2..8852fd378f7a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -371,6 +371,7 @@ int panfrost_mmu_pgtable_alloc(struct panfrost_file_priv *priv) .pgsize_bitmap = SZ_4K | SZ_2M, .ias = FIELD_GET(0xff, pfdev->features.mmu_features), .oas = FIELD_GET(0xff00, pfdev->features.mmu_features), + .coherent_walk = pfdev->coherent, .tlb = &mmu_tlb_ops, .iommu_dev = pfdev->dev, };

-- 2.28.0.dirty

Steven Price

17 Sep 17 Sep

10:37 a.m.

New subject: [PATCH 2/3] drm/panfrost: Support cache-coherent integrations

On 16/09/2020 00:51, Robin Murphy wrote:

...

When the GPU's ACE-Lite interface is fully wired up and capable of snooping CPU caches, it may be described as "dma-coherent" in devicetree, which will already inform the DMA layer not to perform unnecessary cache maintenance. However, we still need to ensure that the GPU uses the appropriate cacheable outer-shareable attributes in order to generate the requisite snoop signals, and that CPU mappings don't create a mismatch by using a non-cacheable type either.

Signed-off-by: Robin Murphy robin.murphy@arm.com

LGTM

Reviewed-by: Steven Price steven.price@arm.com

...

drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + 4 files changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h index c30c719a8059..b31f45315e96 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.h +++ b/drivers/gpu/drm/panfrost/panfrost_device.h @@ -84,6 +84,7 @@ struct panfrost_device { /* pm_domains for devices with more than one. */ struct device *pm_domain_devs[MAX_PM_DOMAINS]; struct device_link *pm_domain_links[MAX_PM_DOMAINS];

bool coherent;

struct panfrost_features features; const struct panfrost_compatible *comp;

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index ada51df9a7a3..2a6f2f716b2f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -588,6 +588,8 @@ static int panfrost_probe(struct platform_device *pdev) if (!pfdev->comp) return -ENODEV;

pfdev->coherent = device_get_dma_attr(&pdev->dev) == DEV_DMA_COHERENT;

/* Allocate and initialze the DRM device. */ ddev = drm_dev_alloc(&panfrost_drm_driver, &pdev->dev); if (IS_ERR(ddev))

diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c index 33355dd302f1..cdf1a8754eba 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.c +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c @@ -220,6 +220,7 @@ static const struct drm_gem_object_funcs panfrost_gem_funcs = { */ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size) {

struct panfrost_device *pfdev = dev->dev_private; struct panfrost_gem_object *obj;

obj = kzalloc(sizeof(*obj), GFP_KERNEL);

@@ -229,6 +230,7 @@ struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t INIT_LIST_HEAD(&obj->mappings.list); mutex_init(&obj->mappings.lock); obj->base.base.funcs = &panfrost_gem_funcs;

obj->base.map_cached = pfdev->coherent;

return &obj->base.base; }

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index e8f7b11352d2..8852fd378f7a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -371,6 +371,7 @@ int panfrost_mmu_pgtable_alloc(struct panfrost_file_priv *priv) .pgsize_bitmap = SZ_4K | SZ_2M, .ias = FIELD_GET(0xff, pfdev->features.mmu_features), .oas = FIELD_GET(0xff00, pfdev->features.mmu_features),
.coherent_walk	= pfdev->coherent,
.tlb = &mmu_tlb_ops, .iommu_dev = pfdev->dev, };

Robin Murphy

15 Sep 15 Sep

11:51 p.m.

New subject: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy robin.murphy@arm.com --- arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; }; + +&mali { + dma-coherent; +};

-- 2.28.0.dirty

Neil Armstrong

16 Sep 16 Sep

8:26 a.m.

New subject: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

...

According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy robin.murphy@arm.com

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };

+&mali {

dma-coherent;

+};

Thanks a lot for digging, I'll run a test to confirm it fixes the issue !

Neil

Boris Brezillon

5 Oct 5 Oct

8:15 a.m.

New subject: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

Hi Robin, Neil,

On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:

...

Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

...
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy robin.murphy@arm.com

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };

+&mali {

dma-coherent;

+};

Thanks a lot for digging, I'll run a test to confirm it fixes the issue !

Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.

[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---

Steven Price

8:34 a.m.

New subject: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

On 05/10/2020 09:15, Boris Brezillon wrote:

...

Hi Robin, Neil,

On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:

...
Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

...
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy robin.murphy@arm.com

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };

+&mali {

dma-coherent;

+};

Thanks a lot for digging, I'll run a test to confirm it fixes the issue !

Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.

[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---

It's quite possible this is caused by the GPU seeing a stale page table entry, so perhaps coherency isn't working as well as it should...

Do you get an "Unhandled Page fault" message after this? It might give some clues. Coherency issues are a pain to debug though and it's of course possible there are issues on this specific platform.

Steve

Boris Brezillon

8:39 a.m.

New subject: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

On Mon, 5 Oct 2020 09:34:06 +0100 Steven Price steven.price@arm.com wrote:

...

On 05/10/2020 09:15, Boris Brezillon wrote:

...
Hi Robin, Neil,

On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:

...
Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

...
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy robin.murphy@arm.com

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };

+&mali {

dma-coherent;

+};

Thanks a lot for digging, I'll run a test to confirm it fixes the issue !

Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.

[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---

It's quite possible this is caused by the GPU seeing a stale page table entry, so perhaps coherency isn't working as well as it should...

Do you get an "Unhandled Page fault" message after this?

Yep (see below).

--->8---

[ 689.640491] ------------[ cut here ]------------ [ 689.644754] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 3146000) [ 689.644802] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.663607] Modules linked in: [ 689.666631] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 689.676717] Hardware name: Khadas VIM3 (DT) [ 689.680860] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 689.686380] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.691987] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.697590] sp : ffff800011bcbcd0 [ 689.700867] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 689.706128] x27: ffff0000d89cf750 x26: ffff0000da34a800 [ 689.711389] x25: ffff0000f32409c0 x24: 0000000000000001 [ 689.716650] x23: ffff0000f3240880 x22: ffff0000d456e000 [ 689.721911] x21: 0000000000000000 x20: 0000000000000000 [ 689.727173] x19: 0000000000010001 x18: 0000000000000020 [ 689.732434] x17: 0000000000000000 x16: 0000000000000000 [ 689.737695] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 689.742957] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 689.748218] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 689.753479] x9 : 00000000fffffffe x8 : 0000000000000000 [ 689.758740] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 689.764001] x5 : 0000000000000000 x4 : 0000000000000000 [ 689.769263] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 689.774524] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 689.779786] Call trace: [ 689.782204] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 689.787465] irq_thread_fn+0x2c/0xa0 [ 689.790999] irq_thread+0x184/0x208 [ 689.794451] kthread+0x140/0x160 [ 689.797642] ret_from_fork+0x10/0x34 [ 689.801177] ---[ end trace b5dd8c2dfada8235 ]--- [ 689.805864] panfrost ffe40000.gpu: Unhandled Page fault in AS0 at VA 0x0000000003146080 [ 689.805864] Reason: TODO [ 689.805864] raw fault status: 0x10003C3 [ 689.805864] decoded fault status: SLAVE FAULT [ 689.805864] exception type 0xC3: TRANSLATION_FAULT_LEVEL3 [ 689.805864] access type 0x3: WRITE [ 689.805864] source id 0x100 [ 690.170419] panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x8, head=0x3101100, tail=0x3101100, sched_job=000000004b442768 [ 690.770373] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945123] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945731] ------------[ cut here ]------------

Steven Price

10:05 a.m.

New subject: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

On 05/10/2020 09:39, Boris Brezillon wrote:

...

On Mon, 5 Oct 2020 09:34:06 +0100 Steven Price steven.price@arm.com wrote:

...
On 05/10/2020 09:15, Boris Brezillon wrote:

...
Hi Robin, Neil,

On Wed, 16 Sep 2020 10:26:43 +0200 Neil Armstrong narmstrong@baylibre.com wrote:

...
Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

...
According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy robin.murphy@arm.com

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };

+&mali {

dma-coherent;

+};

Thanks a lot for digging, I'll run a test to confirm it fixes the issue !

Sorry for the late reply. I triggered a dEQP run with this patch applied and I see a bunch of "panfrost ffe40000.gpu: matching BO is not heap type" errors (see below for a full backtrace). That doesn't seem to happen when we drop this dma-coherent property.

[ 690.945731] ------------[ cut here ]------------ [ 690.950003] panfrost ffe40000.gpu: matching BO is not heap type (GPU VA = 319a000) [ 690.950051] WARNING: CPU: 0 PID: 120 at drivers/gpu/drm/panfrost/panfrost_mmu.c:465 panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.968854] Modules linked in: [ 690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: G W 5.9.0-rc5-02434-g7d8109ec5a42 #784 [ 690.981964] Hardware name: Khadas VIM3 (DT) [ 690.986107] pstate: 60000005 (nZCv daif -PAN -UAO BTYPE=--) [ 690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.002836] sp : ffff800011bcbcd0 [ 691.006114] x29: ffff800011bcbcf0 x28: ffff0000f3fe3800 [ 691.011375] x27: ffff0000ceaf5350 x26: ffff0000ca5fc500 [ 691.016636] x25: ffff0000f32409c0 x24: 0000000000000001 [ 691.021897] x23: ffff0000f3240880 x22: ffff0000f3e3a800 [ 691.027159] x21: 0000000000000000 x20: 0000000000000000 [ 691.032420] x19: 0000000000010001 x18: 0000000000000020 [ 691.037681] x17: 0000000000000000 x16: 0000000000000000 [ 691.042942] x15: ffff0000f3fe3c70 x14: ffffffffffffffff [ 691.048204] x13: ffff8000116c2428 x12: ffff8000116c2086 [ 691.053466] x11: ffff800011bcbcd0 x10: ffff800011bcbcd0 [ 691.058727] x9 : 00000000fffffffe x8 : 0000000000000000 [ 691.063988] x7 : 7420706165682074 x6 : ffff8000116c1816 [ 691.069249] x5 : 0000000000000000 x4 : 0000000000000000 [ 691.074510] x3 : 00000000ffffffff x2 : ffff8000e348c000 [ 691.079771] x1 : f1b91ff9af2df000 x0 : 0000000000000000 [ 691.085033] Call trace: [ 691.087452] panfrost_mmu_irq_handler_thread+0x47c/0x650 [ 691.092712] irq_thread_fn+0x2c/0xa0 [ 691.096246] irq_thread+0x184/0x208 [ 691.099699] kthread+0x140/0x160 [ 691.102890] ret_from_fork+0x10/0x34 [ 691.106424] ---[ end trace b5dd8c2dfada8236 ]---

It's quite possible this is caused by the GPU seeing a stale page table entry, so perhaps coherency isn't working as well as it should...

Do you get an "Unhandled Page fault" message after this?

Yep (see below).

--->8---

[...]

...

[ 689.805864] panfrost ffe40000.gpu: Unhandled Page fault in AS0 at VA 0x0000000003146080 [ 689.805864] Reason: TODO [ 689.805864] raw fault status: 0x10003C3 [ 689.805864] decoded fault status: SLAVE FAULT [ 689.805864] exception type 0xC3: TRANSLATION_FAULT_LEVEL3 [ 689.805864] access type 0x3: WRITE [ 689.805864] source id 0x100 [ 690.170419] panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x8, head=0x3101100, tail=0x3101100, sched_job=000000004b442768 [ 690.770373] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945123] panfrost ffe40000.gpu: error powering up gpu shader [ 690.945731] ------------[ cut here ]------------

That's a write fault from level 3 of the page table triggered by shader core 0 in a fragment job. So could be writing out the frame buffer.

It would be interesting to see if a patch like below would work round it:

----8<---- diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index e8f7b11352d2..5144860afdea 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -460,9 +460,12 @@ static int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as,

bo = bomapping->obj; if (!bo->is_heap) { - dev_WARN(pfdev->dev, "matching BO is not heap type (GPU VA = %llx)", + dev_err(pfdev->dev, "matching BO is not heap type (GPU VA = %llx)", bomapping->mmnode.start << PAGE_SHIFT); - ret = -EINVAL; + /* Force a flush of the MMU to restart the transaction */ + mmu_hw_do_operation(pfdev, bomapping->mmu, addr, 0, + AS_COMMAND_FLUSH_MEM); + ret = 0; goto err_bo; } WARN_ON(bomapping->mmu->as != as); ---8<---

That's obviously not the correct solution (the fault shouldn't occur), but if flushing the GPU's caches and retrying works then we know it's a coherency issue. You can also try backing off from the big hammer of AS_COMMAND_FLUSH_MEM and trying AS_COMMAND_FLUSH_PT or AS_COMMAND_UNLOCK.

Steve

Neil Armstrong

16 Sep 16 Sep

2:54 p.m.

New subject: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

On 16/09/2020 01:51, Robin Murphy wrote:

...

According to a downstream commit I found in the Khadas vendor kernel, the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows how to handle this properly) we should describe it as such. Otherwise the mismatch leads to all manner of fun with mismatched attributes and inadvertently snooping stale data from caches, which would account for at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy robin.murphy@arm.com

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi index 9b8548e5f6e5..ee8fcae9f9f0 100644 --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi @@ -135,3 +135,7 @@ map1 { }; }; };

+&mali {

dma-coherent;

+};

Reviewed-by: Neil Armstrong narmstrong@baylibre.com

Neil Armstrong

2:54 p.m.

Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

...

Hi all,

I polished up my original proof-of-concept a little while back, but now that I've got my hands on my Juno again I've been able to actually test it to my satisfaction, so here are proper patches!

I tested on the Kkadas VIM3, and yes it fixes the random FAULTS I have *without*: [ 152.417127] panfrost ffe40000.gpu: gpu sched timeout, js=0, config=0x7300, status=0x58, head=0x3091400, tail=0x3091400, sched_job=000000004d83c2d7 [ 152.530928] panfrost ffe40000.gpu: js fault, js=1, status=INSTR_INVALID_ENC, head=0x30913c0, tail=0x30913c0 [ 152.539797] panfrost ffe40000.gpu: gpu sched timeout, js=1, config=0x7300, status=0x51, head=0x30913c0, tail=0x30913c0, sched_job=0000000038cecaf6 [ 156.943505] panfrost ffe40000.gpu: js fault, js=0, status=TILE_RANGE_FAULT, head=0x3091400, tail=0x3091400

but, with this patchset, I get the following fps with kmscube: Rendered 97 frames in 2.016291 sec (48.108145 fps) Rendered 206 frames in 4.016723 sec (51.285584 fps) Rendered 316 frames in 6.017208 sec (52.516052 fps) Rendered 430 frames in 8.017456 sec (53.632975 fps)

but when I resurrect my BROKEN_NS patchset (simply disabling shareability), I get: Rendered 120 frames in 2.000143 sec (59.995724 fps) Rendered 241 frames in 4.016760 sec (59.998605 fps) Rendered 362 frames in 6.033443 sec (59.998911 fps) Rendered 482 frames in 8.033531 sec (59.998522 fps)

So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.

So: Tested-by: Neil Armstrong narmstrong@baylibre.com

Neil

...

It probably makes sense for patches #1 and #2 to stay together and both go via drm-misc, provided Will's OK with that.

Robin.

Robin Murphy (3): iommu/io-pgtable-arm: Support coherency for Mali LPAE drm/panfrost: Support cache-coherent integrations arm64: dts: meson: Describe G12b GPU as coherent

arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 ++++ drivers/gpu/drm/panfrost/panfrost_device.h | 1 + drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 + drivers/iommu/io-pgtable-arm.c | 5 ++++- 6 files changed, 14 insertions(+), 1 deletion(-)

Alyssa Rosenzweig

5:04 p.m.

...

So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.

That's bizarre -- this should really be the faster of the two.

Rob Herring

5:46 p.m.

On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:

...

...
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.

That's bizarre -- this should really be the faster of the two.

Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.

Rob

Steven Price

17 Sep 17 Sep

10:38 a.m.

On 16/09/2020 18:46, Rob Herring wrote:

...

On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:

...
...
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.

That's bizarre -- this should really be the faster of the two.

Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.

The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.

Steve

Tomeu Vizoso

10:51 a.m.

On 9/17/20 12:38 PM, Steven Price wrote:

...

On 16/09/2020 18:46, Rob Herring wrote:

...
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:

...
...
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.

That's bizarre -- this should really be the faster of the two.

Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.

The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.

Or maybe it's only a problem for applications that do silly things? I don't think kmscube was ever optimized for performance.

Regards,

Tomeu

Steven Price

11 a.m.

On 17/09/2020 11:51, Tomeu Vizoso wrote:

...

On 9/17/20 12:38 PM, Steven Price wrote:

...
On 16/09/2020 18:46, Rob Herring wrote:

...
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:

...
...
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.

That's bizarre -- this should really be the faster of the two.

Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.

The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.

Or maybe it's only a problem for applications that do silly things? I don't think kmscube was ever optimized for performance.

Well doing silly things is almost the definition of a benchmark ;) A lot of the mobile graphics benchmarks suffer from not being very intelligent in how they render (e.g. many have meshes that are far too detailed so the triangles are smaller than the pixels).

Of course there are also applications that get things wrong too.

Steve

Alyssa Rosenzweig

12:03 p.m.

...

The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.

Right, Panfrost userspace specifically assumes GPU reads to be expensive and treats GPU memory as write-only *except* for a few special cases (compute-like workloads, glReadPixels, some blits, etc).

The vast majority of the GPU memory - everything used in kmscube - will be write-only to the CPU and fed directly into the display zero-copy (or read by the GPU later as a dmabuf).

Robin Murphy

12:38 p.m.

On 2020-09-16 18:46, Rob Herring wrote:

...

On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig alyssa.rosenzweig@collabora.com wrote:

...
...
So I get a performance regression with the dma-coherent approach, even if it's clearly the cleaner.

That's bizarre -- this should really be the faster of the two.

Coherency may not be free. CortexA9 had something like 4x slower memcpy if SMP was enabled as an example. I don't know if there's anything going on like that specifically here. If there's never any CPU accesses mixed in with kmscube, then there would be no benefit to coherency.

There will still be CPU benefits in terms of not having to spend time cache-cleaning every BO upon allocation, and less overhead on writing out descriptors in the first place (due to cacheable vs. non-cacheable).

I haven't tried the NSh hack on Juno, but I don't see any notable performance issue as-is - kmscube hits a solid 60FPS from the off (now that it works without spewing faults). Given that the hardware on Juno can be generously described as "less good", it would certainly be interesting to figure out what difference is at play here...

The usual argument against I/O coherency is that it adds latency to every access, but if you already have a coherent interconnect anyway then the sensible answer to that is implementing decent snoop filters, rather than making software more complicated.

Robin.

1667

Age (days ago)

1687

Last active (days ago)

dri-devel@lists.freedesktop.org

21 comments

8 participants

tags (0)

participants (8)

Alyssa Rosenzweig
Boris Brezillon
Neil Armstrong
Rob Herring
Robin Murphy
Steven Price
Tomeu Vizoso
Will Deacon