From: Alexey Sheplyakov asheplyakov@basealt.ru
Hello,
these patches address some problems which prevent panfrost from working with Mali T628 GPUs:
- incomplete/incorrect mmu memory attributes - wrong jobs affinity on dual core group GPUs
With these patches panfrost is able to drive mali T628 (r1p0) GPU on some armv8 SoCs (in particular BE-M1000). r0 GPUs are still not supported [yet] (tested with Exynos 5422).
Alexey Sheplyakov (2): drm/panfrost: mmu: improved memory attributes drm/panfrost: adjusted job affinity for dual core group GPUs
drivers/gpu/drm/panfrost/panfrost_device.h | 7 ++++ drivers/gpu/drm/panfrost/panfrost_gpu.c | 45 ++++++++++++++++++++++ drivers/gpu/drm/panfrost/panfrost_job.c | 14 ++++++- drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 -- drivers/iommu/io-pgtable-arm.c | 16 ++++++-- 5 files changed, 76 insertions(+), 9 deletions(-)
On 13/01/2022 10:01, Alexey Sheplyakov wrote:
Hi, Steven!
Thanks for such a detailed explanation of T628 peculiarities.
On Wed, Jan 12, 2022 at 05:03:15PM +0000, Steven Price wrote:
On 10/01/2022 17:42, Alyssa Rosenzweig wrote:
Whether it's worth the effort depends on whether anyone really cares about getting the full performance out of this particular GPU.
At this stage I think the main UABI change would be to add the opposite flag to kbase, (e.g. "PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU"[1]) to opt-in to allowing the job to run across all cores.
The second change would be to allow compute jobs to be run on the second core group, so another flag: PANFROST_RUN_ON_SECOND_CORE_GROUP.
But clearly there's little point adding such flags until someone steps up to do the Mesa work.
I worry about the maintainence burden (both Mesa and kernel) of adding UABI only used by a piece of hardware none of us own, and only useful "sometimes" for that hardware. Doubly so for the second core group support; currently Mesa doesn't advertise any compute support on anything older than Mali T760 ... to the best of my knowledge, nobody has missed that support either...
I agree there's no point adding the UABI support unless someone is willing to step and be a maintainer for that hardware. And I suspect no one cares enough about that hardware to do that.
To be clear I am in favour of merging the patches needed for GLES2 to work on all Malis, possibly at a performance cost on these dual-core systems. That's a far cry from the level of support the DDK gave these chips back in the day ... of course, the DDK doesn't support them at all anymore, so Panfrost wins there by default! ;)
Agreed - I'm happy to merge a kernel series similar to this. I think the remaining problems are:
- Addressing Robin's concerns about the first patch. That looks like
it's probably just wrong.
The first patch is wrong and I'll drop it.
- I think this patch is too complex for the basic support. There's some
parts like checking GROUPS_L2_COHERENT which also don't feature in kbase
That part has been adapted from kbase_gpuprops_construct_coherent_groups, see https://github.com/hardkernel/linux/blob/2f0f4268209ddacc2cdea158104b87cedac...
Gah! I was looking at the wrong version of kbase... ;)
so I don't believe are correct.
I'm not sure if it's correct or not, however
- it does not change anything for GPUs with coherent L2 caches
- it appears to correctly figure out core groups for several SoCs with T628 GPU (BE-M1000, Exynos 5422).
Yeah, sorry about that - you are right this matches the (earlier) kbase versions. I don't believe there ever was a GPU released without GROUPS_L2_COHERENT set (which means that shader cores are coherent within an L2, *NOT* that the L2 is coherent).
I think I was having an off day yesterday... ;) Ignore this - it's arguably cargo-culting from kbase, but there are advantages to that.
- I don't think this blocks the change. But if we're not using the
second core group we could actually power it down. Indeed simply not turning on the L2/shader cores should in theory work (jobs are not scheduled to cores which are turned off even if they are included in the affinity).
Powering off unused GPU is would be nice, however
- the userspace might power on those cores again (via sysfs or something), so I prefer to explicitly schedule jobs to the core group 0.
We don't expose that level of control to user space. Currently panfrost either powers on all cores or none.
- on BE-M1000 GPU seems to lock up in a few seconds after powering off some (GPU) cores. In fact I had to disable GPU devfreq to prevent GPU lockups.
I'm not sure how you tested powering off some GPU cores - I'm surprised that that causes problems. Having to disable GPU devfreq points to an issue with the DVFS performance points in your DT (perhaps the voltages are wrong?) or potentially a bug in the driving of the clocks/regulators.
Therefore I consider powering off unused cores as a later optimization. (frankly speaking I'd better put the effort to *making use* of those cores instead of figuring out why they fail to power down properly).
Making use is much more work - it depends if you(/anyone) cares about power consumption on these devices. Although whether the hardware actually implements any meaningful power gating for the second core group is another matter so perhaps it wouldn't make any difference anyway.
My thought was something along the lines of the below which just turns the cores off and otherwise doesn't require playing with affinities. If we actually want to use the second core group then much more thought will have to go into how the job slots are used.
---8<--- diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index bbe628b306ee..2e9f9d1ee830 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -320,19 +320,34 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev) { int ret; u32 val; + u32 core_mask = U32_MAX;
panfrost_gpu_init_quirks(pfdev);
- /* Just turn on everything for now */ - gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present); + if (pfdev->features.l2_present != 1) { + /* + * Only support one core group for now. + * ~(l2_present - 1) unsets all bits in l2_present except the + * bottom bit. (l2_present - 2) has all the bits in the first + * core group set. AND them together to generate a mask of + * cores in the first core group. + */ + core_mask = ~(pfdev->features.l2_present - 1) & + (pfdev->features.l2_present - 2); + } + + gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present & core_mask); ret = readl_relaxed_poll_timeout(pfdev->iomem + L2_READY_LO, - val, val == pfdev->features.l2_present, 100, 20000); + val, val == (pfdev->features.l2_present & core_mask), + 100, 20000); if (ret) dev_err(pfdev->dev, "error powering up gpu L2");
- gpu_write(pfdev, SHADER_PWRON_LO, pfdev->features.shader_present); + gpu_write(pfdev, SHADER_PWRON_LO, + pfdev->features.shader_present & core_mask); ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_READY_LO, - val, val == pfdev->features.shader_present, 100, 20000); + val, val == (pfdev->features.shader_present & core_mask), + 100, 20000); if (ret) dev_err(pfdev->dev, "error powering up gpu shader");
---8<--- I don't have a dual-core group system to hand so this is mostly untested.
Thanks,
Steve
From: Alexey Sheplyakov asheplyakov@basealt.ru
T62x/T60x GPUs are known to not work with panfrost as of now. One of the reasons is wrong/incomplete memory attributes which the panfrost driver sets in the page tables:
- MEMATTR_IMP_DEF should be 0x48ULL, not 0x88ULL. 0x88ULL is MEMATTR_OUTER_IMP_DEF - MEMATTR_FORCE_CACHE_ALL and MEMATTR_OUTER_WA are missing.
T72x and newer GPUs work just fine with such incomplete/wrong memory attributes. However T62x are quite picky and quickly lock up.
To avoid the problem set the same memory attributes (in LPAE page tables) as mali_kbase.
The patch has been tested (for regressions) with T860 GPU (rk3399 SoC). At the first glance (using GNOME desktop, running glmark) it does not cause any changes for this GPU.
Note: this patch is necessary, but *not* enough to get panfrost working with T62x
Signed-off-by: Alexey Sheplyakov asheplyakov@basealt.ru Signed-off-by: Vadim V. Vlasov vadim.vlasov@elpitech.ru
Cc: Rob Herring robh@kernel.org Cc: Tomeu Vizoso tomeu.vizoso@collabora.com Cc: Steven Price steven.price@arm.com Cc: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Cc: Vadim V. Vlasov vadim.vlasov@elpitech.ru --- drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 --- drivers/iommu/io-pgtable-arm.c | 16 ++++++++++++---- 2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index 39562f2d11a4..2f4f8a17bc82 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -133,9 +133,6 @@ static void panfrost_mmu_enable(struct panfrost_device *pfdev, struct panfrost_m mmu_write(pfdev, AS_TRANSTAB_LO(as_nr), lower_32_bits(transtab)); mmu_write(pfdev, AS_TRANSTAB_HI(as_nr), upper_32_bits(transtab));
- /* Need to revisit mem attrs. - * NC is the default, Mali driver is inner WT. - */ mmu_write(pfdev, AS_MEMATTR_LO(as_nr), lower_32_bits(memattr)); mmu_write(pfdev, AS_MEMATTR_HI(as_nr), upper_32_bits(memattr));
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index dd9e47189d0d..15b39c337e20 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -122,13 +122,17 @@ #define ARM_LPAE_MAIR_ATTR_IDX_CACHE 1 #define ARM_LPAE_MAIR_ATTR_IDX_DEV 2 #define ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE 3 +#define ARM_LPAE_MAIR_ATTR_IDX_OUTER_WA 4
#define ARM_MALI_LPAE_TTBR_ADRMODE_TABLE (3u << 0) #define ARM_MALI_LPAE_TTBR_READ_INNER BIT(2) #define ARM_MALI_LPAE_TTBR_SHARE_OUTER BIT(4)
-#define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL -#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL +#define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x48ULL +#define ARM_MALI_LPAE_MEMATTR_FORCE_CACHE_ALL 0x4FULL +#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x4DULL +#define ARM_MALI_LPAE_MEMATTR_OUTER_IMP_DEF 0x88ULL +#define ARM_MALI_LPAE_MEMATTR_OUTER_WA 0x8DULL
#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7) #define APPLE_DART_PTE_PROT_NO_READ (1<<8) @@ -1080,10 +1084,14 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) cfg->arm_mali_lpae_cfg.memattr = (ARM_MALI_LPAE_MEMATTR_IMP_DEF << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) | + (ARM_MALI_LPAE_MEMATTR_FORCE_CACHE_ALL + << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) | (ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) | - (ARM_MALI_LPAE_MEMATTR_IMP_DEF - << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV)); + (ARM_MALI_LPAE_MEMATTR_OUTER_IMP_DEF + << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV)) | + (ARM_MALI_LPAE_MEMATTR_OUTER_WA + << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_OUTER_WA));
data->pgd = __arm_lpae_alloc_pages(ARM_LPAE_PGD_SIZE(data), GFP_KERNEL, cfg);
On 2021-12-23 11:06, asheplyakov@basealt.ru wrote:
From: Alexey Sheplyakov asheplyakov@basealt.ru
T62x/T60x GPUs are known to not work with panfrost as of now. One of the reasons is wrong/incomplete memory attributes which the panfrost driver sets in the page tables:
- MEMATTR_IMP_DEF should be 0x48ULL, not 0x88ULL. 0x88ULL is MEMATTR_OUTER_IMP_DEF
I guess the macro could be renamed if anyone's particularly bothered, but using the outer-cacheable attribute is deliberate because it is necessary for I/O-coherent GPUs to work properly (and should be irrelevant for non-coherent integrations). I'm away from my Juno board for the next couple of weeks but I'm fairly confident that this change as-is will break cache snooping.
- MEMATTR_FORCE_CACHE_ALL and MEMATTR_OUTER_WA are missing.
They're "missing" because they're not used, and there's currently no mechanism by which they *could* be used. Also note that the indices in MEMATTR have to line up with the euqivalent LPAE indices for the closest match to what the IOMMU API's IOMMU_{CACHE,MMIO} flags represent, so moving those around is yet more subtle breakage.
T72x and newer GPUs work just fine with such incomplete/wrong memory attributes. However T62x are quite picky and quickly lock up.
To avoid the problem set the same memory attributes (in LPAE page tables) as mali_kbase.
The patch has been tested (for regressions) with T860 GPU (rk3399 SoC). At the first glance (using GNOME desktop, running glmark) it does not cause any changes for this GPU.
Note: this patch is necessary, but *not* enough to get panfrost working with T62x
I'd note that panfrost has been working OK - to the extent that Mesa supports its older ISA - on the T624 (single core group) in Arm's Juno SoC for over a year now since commit 268af50f38b1.
If you have to force outer non-cacheable to avoid getting translation faults and other errors that look like the GPU is inexplicably seeing the wrong data, I'd check whether you have the same thing where your integration is actually I/O-coherent and you're missing the "dma-coherent" property in your DT.
Thanks, Robin.
Signed-off-by: Alexey Sheplyakov asheplyakov@basealt.ru Signed-off-by: Vadim V. Vlasov vadim.vlasov@elpitech.ru
Cc: Rob Herring robh@kernel.org Cc: Tomeu Vizoso tomeu.vizoso@collabora.com Cc: Steven Price steven.price@arm.com Cc: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Cc: Vadim V. Vlasov vadim.vlasov@elpitech.ru
drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 --- drivers/iommu/io-pgtable-arm.c | 16 ++++++++++++---- 2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index 39562f2d11a4..2f4f8a17bc82 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -133,9 +133,6 @@ static void panfrost_mmu_enable(struct panfrost_device *pfdev, struct panfrost_m mmu_write(pfdev, AS_TRANSTAB_LO(as_nr), lower_32_bits(transtab)); mmu_write(pfdev, AS_TRANSTAB_HI(as_nr), upper_32_bits(transtab));
- /* Need to revisit mem attrs.
* NC is the default, Mali driver is inner WT.
mmu_write(pfdev, AS_MEMATTR_LO(as_nr), lower_32_bits(memattr)); mmu_write(pfdev, AS_MEMATTR_HI(as_nr), upper_32_bits(memattr));*/
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index dd9e47189d0d..15b39c337e20 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -122,13 +122,17 @@ #define ARM_LPAE_MAIR_ATTR_IDX_CACHE 1 #define ARM_LPAE_MAIR_ATTR_IDX_DEV 2 #define ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE 3 +#define ARM_LPAE_MAIR_ATTR_IDX_OUTER_WA 4
#define ARM_MALI_LPAE_TTBR_ADRMODE_TABLE (3u << 0) #define ARM_MALI_LPAE_TTBR_READ_INNER BIT(2) #define ARM_MALI_LPAE_TTBR_SHARE_OUTER BIT(4)
-#define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL -#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL +#define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x48ULL +#define ARM_MALI_LPAE_MEMATTR_FORCE_CACHE_ALL 0x4FULL +#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x4DULL +#define ARM_MALI_LPAE_MEMATTR_OUTER_IMP_DEF 0x88ULL +#define ARM_MALI_LPAE_MEMATTR_OUTER_WA 0x8DULL
#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7) #define APPLE_DART_PTE_PROT_NO_READ (1<<8) @@ -1080,10 +1084,14 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) cfg->arm_mali_lpae_cfg.memattr = (ARM_MALI_LPAE_MEMATTR_IMP_DEF << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
(ARM_MALI_LPAE_MEMATTR_FORCE_CACHE_ALL
(ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
(ARM_MALI_LPAE_MEMATTR_IMP_DEF
<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
(ARM_MALI_LPAE_MEMATTR_OUTER_IMP_DEF
<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV)) |
(ARM_MALI_LPAE_MEMATTR_OUTER_WA
<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_OUTER_WA));
data->pgd = __arm_lpae_alloc_pages(ARM_LPAE_PGD_SIZE(data), GFP_KERNEL, cfg);
Hi, Robin,
On Fri, Dec 24, 2021 at 12:56:57PM +0000, Robin Murphy wrote:
I'd note that panfrost has been working OK - to the extent that Mesa supports its older ISA - on the T624 (single core group) in Arm's Juno SoC for over a year now since commit 268af50f38b1.
If you have to force outer non-cacheable to avoid getting translation faults and other errors that look like the GPU is inexplicably seeing the wrong data, I'd check whether you have the same thing where your integration is actually I/O-coherent and you're missing the "dma-coherent" property in your DT.
Thanks for a detailed explanation. Indeed adding the "dma-coherent" property (and the 2nd patch which effectively disables the 2nd core group) is enough to get panfrost working with T628 on BE-M1000 SoC. Unfortunately the same trick does NOT work for Exynos 5422. Here I get an Oops as soon as the driver tries to reset the GPU [1]. My patch does not work for Exynos 5422 either (but in a different way: GPU locks up in a few seconds).
So the first patch seems to be wrong and I'll drop it.
Best regards, Alexey
[1]
[ 73.061941] panfrost 11800000.gpu: AS_ACTIVE bit stuck [ 73.065671] 8<--- cut here --- [ 73.067015] Power domain G3D disable failed [ 73.068637] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf09d0020 [ 73.080424] pgd = f347788b [ 73.083108] [f09d0020] *pgd=414eb811, *pte=11800653, *ppte=11800453 [ 73.089352] Internal error: : 1008 [#1] PREEMPT SMP ARM [ 73.094549] Modules linked in: xfrm_user l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc rfkill loop zstd input_leds panfrost lzo_rle gpu_sched evdev uio_pdrv_genirq uio exynos_gpiomem zram sch_fq_codel ip_tables ipv6 btrfs blake2b_generic xor xor_neon zstd_compress lzo_compress zlib_deflate raid6_pq libcrc32c usbhid gpio_keys [ 73.126264] CPU: 3 PID: 130 Comm: kworker/u16:3 Not tainted 5.15.8-00057-g5ecb31848317 #1 [ 73.134407] Hardware name: Hardkernel ODROID-XU4 [ 73.139001] Workqueue: panfrost-reset panfrost_reset_work [panfrost] [ 73.145325] PC is at panfrost_gpu_soft_reset+0xa0/0x104 [panfrost] [ 73.151477] LR is at schedule_hrtimeout_range_clock+0x114/0x240 [ 73.157370] pc : [<bf2c092c>] lr : [<c0bca590>] psr: 600e0013 [ 73.163609] sp : c16f3e88 ip : 00004710 fp : c16f3ea4 [ 73.168809] r10: c1255220 r9 : c5982a88 r8 : c5982a88 [ 73.174007] r7 : c5982a7c r6 : 00000011 r5 : 0364a751 r4 : c5982840 [ 73.180506] r3 : f09d0000 r2 : 00000000 r1 : 00000000 r0 : 00000000 [ 73.187006] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 73.194112] Control: 10c5387d Table: 45d5c06a DAC: 00000051 [ 73.199830] Register r0 information: NULL pointer [ 73.204509] Register r1 information: NULL pointer [ 73.209189] Register r2 information: NULL pointer [ 73.213868] Register r3 information: 0-page vmalloc region starting at 0xf09d0000 allocated at __devm_ioremap_resource+0x170/0x1e8 [ 73.225566] Register r4 information: slab kmalloc-1k start c5982800 pointer offset 64 size 1024 [ 73.234232] Register r5 information: non-paged memory [ 73.239258] Register r6 information: non-paged memory [ 73.244284] Register r7 information: slab kmalloc-1k start c5982800 pointer offset 636 size 1024 [ 73.253036] Register r8 information: slab kmalloc-1k start c5982800 pointer offset 648 size 1024 [ 73.261788] Register r9 information: slab kmalloc-1k start c5982800 pointer offset 648 size 1024 [ 73.270540] Register r10 information: non-slab/vmalloc memory [ 73.276259] Register r11 information: non-slab/vmalloc memory [ 73.281978] Register r12 information: non-paged memory [ 73.287091] Process kworker/u16:3 (pid: 130, stack limit = 0x6da5c776) [ 73.293591] Stack: (0xc16f3e88 to 0xc16f4000) [ 73.297926] 3e80: c5982840 00000000 00000000 c5982a7c c16f3ebc c16f3ea8 [ 73.306072] 3ea0: bf2bf65c bf2c0898 c5982840 00000000 c16f3ef4 c16f3ec0 bf2c278c bf2bf64c [ 73.314218] 3ec0: c16f3ef4 c16f3ed0 c015cf28 c5982aa4 c44ba900 c1410a00 c1e9c400 00000000 [ 73.322363] 3ee0: c1e9c405 c1255220 c16f3f04 c16f3ef8 bf2c29cc bf2c2538 c16f3f44 c16f3f08 [ 73.330509] 3f00: c014b3ac bf2c29b8 c1410a00 c1410a00 c1410a00 c1410a00 c1410a00 c44ba900 [ 73.338654] 3f20: c1410a00 c44ba918 c1410a18 c1103d00 00000088 c1410a00 c16f3f7c c16f3f48 [ 73.346800] 3f40: c014bc18 c014b1c0 c127b138 c16f2000 c0151004 c44bb880 c3747500 c014bbb4 [ 73.354945] 3f60: c44ba900 c16f2000 c35e5e8c c3747520 c16f3fac c16f3f80 c0152ee8 c014bbc0 [ 73.363091] 3f80: 00000000 c44bb880 c0152d7c 00000000 00000000 00000000 00000000 00000000 [ 73.371236] 3fa0: 00000000 c16f3fb0 c01000fc c0152d88 00000000 00000000 00000000 00000000 [ 73.379381] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 73.387526] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 [ 73.395669] Backtrace: [ 73.398096] [<bf2c088c>] (panfrost_gpu_soft_reset [panfrost]) from [<bf2bf65c>] (panfrost_device_reset+0x1c/0x38 [panfrost]) [ 73.409278] r7:c5982a7c r6:00000000 r5:00000000 r4:c5982840 [ 73.414906] [<bf2bf640>] (panfrost_device_reset [panfrost]) from [<bf2c278c>] (panfrost_reset+0x260/0x378 [panfrost]) [ 73.425479] r5:00000000 r4:c5982840 [ 73.429031] [<bf2c252c>] (panfrost_reset [panfrost]) from [<bf2c29cc>] (panfrost_reset_work+0x20/0x24 [panfrost]) [ 73.439260] r10:c1255220 r9:c1e9c405 r8:00000000 r7:c1e9c400 r6:c1410a00 r5:c44ba900 [ 73.447055] r4:c5982aa4 [ 73.449568] [<bf2c29ac>] (panfrost_reset_work [panfrost]) from [<c014b3ac>] (process_one_work+0x1f8/0x5c0) [ 73.459186] [<c014b1b4>] (process_one_work) from [<c014bc18>] (worker_thread+0x64/0x580) [ 73.467250] r10:c1410a00 r9:00000088 r8:c1103d00 r7:c1410a18 r6:c44ba918 r5:c1410a00 [ 73.475045] r4:c44ba900 [ 73.477557] [<c014bbb4>] (worker_thread) from [<c0152ee8>] (kthread+0x16c/0x1a0) [ 73.484927] r10:c3747520 r9:c35e5e8c r8:c16f2000 r7:c44ba900 r6:c014bbb4 r5:c3747500 [ 73.492722] r4:c44bb880 [ 73.495234] [<c0152d7c>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) [ 73.502427] Exception stack(0xc16f3fb0 to 0xc16f3ff8) [ 73.507454] 3fa0: 00000000 00000000 00000000 00000000 [ 73.515601] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 73.523746] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 73.530334] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0152d7c [ 73.538129] r4:c44bb880 r3:00000000 [ 73.541685] Code: e3a0001a ba000010 eb64253c e594300c (e5933020) [ 73.547753] ---[ end trace 1af2dce52aebcc96 ]---
However my patch does not work for Exynos 5422 either, so I'll drop the first patch.
Thanks, Robin.
Signed-off-by: Alexey Sheplyakov asheplyakov@basealt.ru Signed-off-by: Vadim V. Vlasov vadim.vlasov@elpitech.ru
Cc: Rob Herring robh@kernel.org Cc: Tomeu Vizoso tomeu.vizoso@collabora.com Cc: Steven Price steven.price@arm.com Cc: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Cc: Vadim V. Vlasov vadim.vlasov@elpitech.ru
drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 --- drivers/iommu/io-pgtable-arm.c | 16 ++++++++++++---- 2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index 39562f2d11a4..2f4f8a17bc82 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -133,9 +133,6 @@ static void panfrost_mmu_enable(struct panfrost_device *pfdev, struct panfrost_m mmu_write(pfdev, AS_TRANSTAB_LO(as_nr), lower_32_bits(transtab)); mmu_write(pfdev, AS_TRANSTAB_HI(as_nr), upper_32_bits(transtab));
- /* Need to revisit mem attrs.
* NC is the default, Mali driver is inner WT.
mmu_write(pfdev, AS_MEMATTR_LO(as_nr), lower_32_bits(memattr)); mmu_write(pfdev, AS_MEMATTR_HI(as_nr), upper_32_bits(memattr));*/
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index dd9e47189d0d..15b39c337e20 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -122,13 +122,17 @@ #define ARM_LPAE_MAIR_ATTR_IDX_CACHE 1 #define ARM_LPAE_MAIR_ATTR_IDX_DEV 2 #define ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE 3 +#define ARM_LPAE_MAIR_ATTR_IDX_OUTER_WA 4 #define ARM_MALI_LPAE_TTBR_ADRMODE_TABLE (3u << 0) #define ARM_MALI_LPAE_TTBR_READ_INNER BIT(2) #define ARM_MALI_LPAE_TTBR_SHARE_OUTER BIT(4) -#define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL -#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL +#define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x48ULL +#define ARM_MALI_LPAE_MEMATTR_FORCE_CACHE_ALL 0x4FULL +#define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x4DULL +#define ARM_MALI_LPAE_MEMATTR_OUTER_IMP_DEF 0x88ULL +#define ARM_MALI_LPAE_MEMATTR_OUTER_WA 0x8DULL #define APPLE_DART_PTE_PROT_NO_WRITE (1<<7) #define APPLE_DART_PTE_PROT_NO_READ (1<<8) @@ -1080,10 +1084,14 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) cfg->arm_mali_lpae_cfg.memattr = (ARM_MALI_LPAE_MEMATTR_IMP_DEF << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_NC)) |
(ARM_MALI_LPAE_MEMATTR_FORCE_CACHE_ALL
(ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
(ARM_MALI_LPAE_MEMATTR_IMP_DEF
<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
(ARM_MALI_LPAE_MEMATTR_OUTER_IMP_DEF
<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV)) |
(ARM_MALI_LPAE_MEMATTR_OUTER_WA
data->pgd = __arm_lpae_alloc_pages(ARM_LPAE_PGD_SIZE(data), GFP_KERNEL, cfg);<< ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_OUTER_WA));
From: Alexey Sheplyakov asheplyakov@basealt.ru
On a dual core group GPU it's not OK to run some jobs on any core. For such jobs the following affinity rule must be satisfied: job slot 2 must target the core group 1, and slots 0, 1 - the core group 0, respectively.
The kernel driver itself can't guess which jobs need a such a strict affinity, so setting proper requirements is the responsibility of the userspace (Mesa). However the userspace is not smart enough [yet]. Therefore this patch applies the above affinity rule to all jobs on dual core group GPUs.
With this patch panfrost is able to drive T628 (r1p0) GPU on some armv8 SoCs (in particular BE-M1000).
The patch has been also tested with T860 (rk3399 SoC). At the first glance (using GNOME desktop, running glmark) the patch does not cause any changes with T860 GPU.
Signed-off-by: Alexey Sheplyakov asheplyakov@basealt.ru Signed-off-by: Vadim V. Vlasov vadim.vlasov@elpitech.ru
Cc: Rob Herring robh@kernel.org Cc: Tomeu Vizoso tomeu.vizoso@collabora.com Cc: Steven Price steven.price@arm.com Cc: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Cc: Vadim V. Vlasov vadim.vlasov@elpitech.ru --- drivers/gpu/drm/panfrost/panfrost_device.h | 7 ++++ drivers/gpu/drm/panfrost/panfrost_gpu.c | 45 ++++++++++++++++++++++ drivers/gpu/drm/panfrost/panfrost_job.c | 14 ++++++- 3 files changed, 64 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h index 8b25278f34c8..aa43b1413c8a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.h +++ b/drivers/gpu/drm/panfrost/panfrost_device.h @@ -23,6 +23,12 @@ struct panfrost_perfcnt;
#define NUM_JOB_SLOTS 3 #define MAX_PM_DOMAINS 3 +#define MAX_COHERENT_GROUPS 2 + +struct panfrost_coherent_group { + u64 core_mask; + unsigned int nr_cores; +};
struct panfrost_features { u16 id; @@ -54,6 +60,7 @@ struct panfrost_features {
unsigned long hw_features[64 / BITS_PER_LONG]; unsigned long hw_issues[64 / BITS_PER_LONG]; + struct panfrost_coherent_group core_groups[MAX_COHERENT_GROUPS]; };
/* diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index bbe628b306ee..a02cb160cb4f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -209,6 +209,41 @@ static const struct panfrost_model gpu_models[] = { GPU_REV(g31, 1, 0)), };
+static void panfrost_decode_coherent_groups(struct panfrost_features *features) +{ + struct panfrost_coherent_group *current_group; + u64 group_present; + u64 group_mask; + u64 first_set, first_set_prev; + u32 nr_group = 0; + + if (features->mem_features & GROUPS_L2_COHERENT) + group_present = features->l2_present; + else + group_present = features->shader_present; + + current_group = features->core_groups; + first_set = group_present & ~(group_present - 1); + + while (group_present != 0 && nr_group < MAX_COHERENT_GROUPS) { + group_present -= first_set; + first_set_prev = first_set; + + first_set = group_present & ~(group_present - 1); + group_mask = (first_set - 1) & ~(first_set_prev - 1); + current_group->core_mask = group_mask & features->shader_present; + current_group->nr_cores = hweight64(current_group->core_mask); + + nr_group++; + current_group++; + } + + if (group_present) { + pr_warn("%s: too many coherent groups, expected <= %d", + __func__, MAX_COHERENT_GROUPS); + } +} + static void panfrost_gpu_init_features(struct panfrost_device *pfdev) { u32 gpu_id, num_js, major, minor, status, rev; @@ -217,6 +252,7 @@ static void panfrost_gpu_init_features(struct panfrost_device *pfdev) u64 hw_issues = hw_issues_all; const struct panfrost_model *model; int i; + unsigned long core_mask[64/BITS_PER_LONG];
pfdev->features.l2_features = gpu_read(pfdev, GPU_L2_FEATURES); pfdev->features.core_features = gpu_read(pfdev, GPU_CORE_FEATURES); @@ -296,6 +332,7 @@ static void panfrost_gpu_init_features(struct panfrost_device *pfdev)
bitmap_from_u64(pfdev->features.hw_features, hw_feat); bitmap_from_u64(pfdev->features.hw_issues, hw_issues); + panfrost_decode_coherent_groups(&pfdev->features);
dev_info(pfdev->dev, "mali-%s id 0x%x major 0x%x minor 0x%x status 0x%x", name, gpu_id, major, minor, status); @@ -314,6 +351,14 @@ static void panfrost_gpu_init_features(struct panfrost_device *pfdev)
dev_info(pfdev->dev, "shader_present=0x%0llx l2_present=0x%0llx", pfdev->features.shader_present, pfdev->features.l2_present); + dev_info(pfdev->dev, "%u core groups\n", pfdev->features.nr_core_groups); + for (i = 0; i < (int)pfdev->features.nr_core_groups; i++) { + bitmap_from_u64(core_mask, pfdev->features.core_groups[i].core_mask); + dev_info(pfdev->dev, "core group %u: cores: %64pbl (total %u)\n", + i, + core_mask, + pfdev->features.core_groups[i].nr_cores); + } }
void panfrost_gpu_power_on(struct panfrost_device *pfdev) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 908d79520853..454515c1e2f1 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -132,11 +132,21 @@ static void panfrost_job_write_affinity(struct panfrost_device *pfdev,
/* * Use all cores for now. - * Eventually we may need to support tiler only jobs and h/w with - * multiple (2) coherent core groups + * Eventually we may need to support tiler only jobs */ affinity = pfdev->features.shader_present;
+ /* Userspace does not set the requirements properly yet. + * Adjust affinity of all jobs on dual core group GPUs + */ + if (pfdev->features.nr_core_groups > 1) { + if (js == 2) + affinity &= pfdev->features.core_groups[1].core_mask; + else + affinity &= pfdev->features.core_groups[0].core_mask; + dev_dbg(pfdev->dev, "js: %d, affinity: %llxu\n", js, affinity); + } + job_write(pfdev, JS_AFFINITY_NEXT_LO(js), lower_32_bits(affinity)); job_write(pfdev, JS_AFFINITY_NEXT_HI(js), upper_32_bits(affinity)); }
The kernel driver itself can't guess which jobs need a such a strict affinity, so setting proper requirements is the responsibility of the userspace (Mesa). However the userspace is not smart enough [yet]. Therefore this patch applies the above affinity rule to all jobs on dual core group GPUs.
What does Mesa need to do for this to work "properly"? What are the limitations of the approach implemented here? If we need to extend it down the line with a UABI change, what would that look like?
Thanks,
Alyssa
Hi,
On 23.12.2021 18:11, Alyssa Rosenzweig wrote:
The kernel driver itself can't guess which jobs need a such a strict affinity, so setting proper requirements is the responsibility of the userspace (Mesa). However the userspace is not smart enough [yet]. Therefore this patch applies the above affinity rule to all jobs on dual core group GPUs.
What does Mesa need to do for this to work "properly"?
I don't know. The blob restricts affinity of jobs with JD_REQ_COHERENT_GROUP requirement. In theory jobs without such a requirement can run on any core, but in practice all jobs in slots 0, 1 are assigned to core group 0 (with workloads I've run - i.e. weston, firefox, glmark2, perhaps it's also SoC dependent). So I've forced all jobs in slots 0, 1 to core group 0. Surprisingly this (and memory attributes adjustment) appeared to be enough to get panfrost working with T628 (on some SoCs). Without these patches GPU locks up in a few seconds.
What are the limitations of the approach implemented here?
Suboptimal performance.
1) There might be job chains which don't care about affinity (I haven't seen any of these yet on systems I've got). 2) There might be dual core group GPUs which don't need such a strict affinity. (I haven't seen any dual core group T[78]xx GPUs yet. This doesn't mean such GPUs don't exist).
If we need to extend it down the line with a UABI change, what would that look like?
I have no idea. And I'm not sure if it's worth the effort (since most jobs end up on core group 0 anyway).
Best regards, Alexey
On 24/12/2021 08:56, Alexey Sheplyakov wrote:
Hi,
On 23.12.2021 18:11, Alyssa Rosenzweig wrote:
The kernel driver itself can't guess which jobs need a such a strict affinity, so setting proper requirements is the responsibility of the userspace (Mesa). However the userspace is not smart enough [yet]. Therefore this patch applies the above affinity rule to all jobs on dual core group GPUs.
What does Mesa need to do for this to work "properly"?
I don't know. The blob restricts affinity of jobs with JD_REQ_COHERENT_GROUP requirement. In theory jobs without such a requirement can run on any core, but in practice all jobs in slots 0, 1 are assigned to core group 0 (with workloads I've run - i.e. weston, firefox, glmark2, perhaps it's also SoC dependent). So I've forced all jobs in slots 0, 1 to core group 0. Surprisingly this (and memory attributes adjustment) appeared to be enough to get panfrost working with T628 (on some SoCs). Without these patches GPU locks up in a few seconds.
Let me fill in a few details here.
The T628 is pretty unique in that it has two core groups, i.e. more than one L2 cache. Previous designs (i.e. T604) didn't have enough cores to require a second core group, and later designs with sufficient cores have coherent L2 caches so act as a single core group (although the hardware has multiple L2s it only reports a single one as they act as if a single cache).
Note that technically the T608, T658 and T678 also exist and have this problem - but I don't believe any products were produced with these (so unless you're in ARM with a very unusual FPGA they can be ignored).
The blob/kbase handle this situation with a new flag JD_REQ_COHERENT_GROUP which specifies that the affinity of a job must land on a single (coherent) core group, and JD_REQ_SPECIFIC_COHERENT_GROUP which allows user space to target a specific group.
In theory fragment shading can be performed over all cores (because a fragment shader job doesn't need coherency between threads), so doesn't need the JD_REQ_COHERENT_GROUP flag, vertex shading however requires to be run on the same core group as the tiler (which always lives in core group 0).
Of course there are various 'tricks' that can happen even within a fragment shader which might require coherency.
So the expected sequence is that vertex+tiling is restricted to core group 0, and fragment shading can be run over all cores. Although there can be issues with performance doing this naïvely because the Job Manager doesn't necessarily share the GPUs cores fairly between vertex and fragment jobs. Also note that a cache flush is needed between running the vertex+tiling and the fragment job to ensure that the extra core group is coherent - this can be expensive, so it may not be worth using the second core group in some situations. I'm not sure what logic the Blob uses for that.
Finally there's GPU compute (i.e. OpenCL): here coherency is usually required, but there's often more information about the amount of coherency needed. In this case it is possible to run different job chains on each core group. This is the only situation there slot 2 is used, and is the reason for the JS_REQ_SPECIFIC_COHERENT_GROUP flag. It's also a nightmare for scheduling as the hardware gets upset if the affinity masks for slot 1 and slot 2 overlap.
What are the limitations of the approach implemented here?
Suboptimal performance.
- There might be job chains which don't care about affinity (I haven't seen any of these yet on systems I've got).
You are effectively throwing away half the cores if everything is pinned to core group 0, so I'm pretty sure the Blob manages to run (some) fragment jobs without the COHERENT_GROUP flag.
But equally this is a reasonable first step for the kernel driver - we can make the GPU look like ever other GPU by pretending the second core group doesn't exist.
- There might be dual core group GPUs which don't need such a strict affinity. (I haven't seen any dual core group T[78]xx GPUs yet. This doesn't mean such GPUs don't exist).
They should all be a single core group (fully coherent L2s).
If we need to extend it down the line with a UABI change, what would that look like?
I have no idea. And I'm not sure if it's worth the effort (since most jobs end up on core group 0 anyway).
Whether it's worth the effort depends on whether anyone really cares about getting the full performance out of this particular GPU.
At this stage I think the main UABI change would be to add the opposite flag to kbase, (e.g. "PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU"[1]) to opt-in to allowing the job to run across all cores.
The second change would be to allow compute jobs to be run on the second core group, so another flag: PANFROST_RUN_ON_SECOND_CORE_GROUP.
But clearly there's little point adding such flags until someone steps up to do the Mesa work.
Steve
[1] Bike-shedding the name might be required ;)
Whether it's worth the effort depends on whether anyone really cares about getting the full performance out of this particular GPU.
At this stage I think the main UABI change would be to add the opposite flag to kbase, (e.g. "PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU"[1]) to opt-in to allowing the job to run across all cores.
The second change would be to allow compute jobs to be run on the second core group, so another flag: PANFROST_RUN_ON_SECOND_CORE_GROUP.
But clearly there's little point adding such flags until someone steps up to do the Mesa work.
I worry about the maintainence burden (both Mesa and kernel) of adding UABI only used by a piece of hardware none of us own, and only useful "sometimes" for that hardware. Doubly so for the second core group support; currently Mesa doesn't advertise any compute support on anything older than Mali T760 ... to the best of my knowledge, nobody has missed that support either...
To be clear I am in favour of merging the patches needed for GLES2 to work on all Malis, possibly at a performance cost on these dual-core systems. That's a far cry from the level of support the DDK gave these chips back in the day ... of course, the DDK doesn't support them at all anymore, so Panfrost wins there by default! ;)
On 10/01/2022 17:42, Alyssa Rosenzweig wrote:
Whether it's worth the effort depends on whether anyone really cares about getting the full performance out of this particular GPU.
At this stage I think the main UABI change would be to add the opposite flag to kbase, (e.g. "PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU"[1]) to opt-in to allowing the job to run across all cores.
The second change would be to allow compute jobs to be run on the second core group, so another flag: PANFROST_RUN_ON_SECOND_CORE_GROUP.
But clearly there's little point adding such flags until someone steps up to do the Mesa work.
I worry about the maintainence burden (both Mesa and kernel) of adding UABI only used by a piece of hardware none of us own, and only useful "sometimes" for that hardware. Doubly so for the second core group support; currently Mesa doesn't advertise any compute support on anything older than Mali T760 ... to the best of my knowledge, nobody has missed that support either...
I agree there's no point adding the UABI support unless someone is willing to step and be a maintainer for that hardware. And I suspect no one cares enough about that hardware to do that.
To be clear I am in favour of merging the patches needed for GLES2 to work on all Malis, possibly at a performance cost on these dual-core systems. That's a far cry from the level of support the DDK gave these chips back in the day ... of course, the DDK doesn't support them at all anymore, so Panfrost wins there by default! ;)
Agreed - I'm happy to merge a kernel series similar to this. I think the remaining problems are:
1. Addressing Robin's concerns about the first patch. That looks like it's probably just wrong.
2. I think this patch is too complex for the basic support. There's some parts like checking GROUPS_L2_COHERENT which also don't feature in kbase so I don't believe are correct.
3. I don't think this blocks the change. But if we're not using the second core group we could actually power it down. Indeed simply not turning on the L2/shader cores should in theory work (jobs are not scheduled to cores which are turned off even if they are included in the affinity).
Thanks,
Steve
Hi, Steven!
Thanks for such a detailed explanation of T628 peculiarities.
On Wed, Jan 12, 2022 at 05:03:15PM +0000, Steven Price wrote:
On 10/01/2022 17:42, Alyssa Rosenzweig wrote:
Whether it's worth the effort depends on whether anyone really cares about getting the full performance out of this particular GPU.
At this stage I think the main UABI change would be to add the opposite flag to kbase, (e.g. "PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU"[1]) to opt-in to allowing the job to run across all cores.
The second change would be to allow compute jobs to be run on the second core group, so another flag: PANFROST_RUN_ON_SECOND_CORE_GROUP.
But clearly there's little point adding such flags until someone steps up to do the Mesa work.
I worry about the maintainence burden (both Mesa and kernel) of adding UABI only used by a piece of hardware none of us own, and only useful "sometimes" for that hardware. Doubly so for the second core group support; currently Mesa doesn't advertise any compute support on anything older than Mali T760 ... to the best of my knowledge, nobody has missed that support either...
I agree there's no point adding the UABI support unless someone is willing to step and be a maintainer for that hardware. And I suspect no one cares enough about that hardware to do that.
To be clear I am in favour of merging the patches needed for GLES2 to work on all Malis, possibly at a performance cost on these dual-core systems. That's a far cry from the level of support the DDK gave these chips back in the day ... of course, the DDK doesn't support them at all anymore, so Panfrost wins there by default! ;)
Agreed - I'm happy to merge a kernel series similar to this. I think the remaining problems are:
- Addressing Robin's concerns about the first patch. That looks like
it's probably just wrong.
The first patch is wrong and I'll drop it.
- I think this patch is too complex for the basic support. There's some
parts like checking GROUPS_L2_COHERENT which also don't feature in kbase
That part has been adapted from kbase_gpuprops_construct_coherent_groups, see https://github.com/hardkernel/linux/blob/2f0f4268209ddacc2cdea158104b87cedac...
so I don't believe are correct.
I'm not sure if it's correct or not, however - it does not change anything for GPUs with coherent L2 caches - it appears to correctly figure out core groups for several SoCs with T628 GPU (BE-M1000, Exynos 5422).
- I don't think this blocks the change. But if we're not using the
second core group we could actually power it down. Indeed simply not turning on the L2/shader cores should in theory work (jobs are not scheduled to cores which are turned off even if they are included in the affinity).
Powering off unused GPU is would be nice, however
1) the userspace might power on those cores again (via sysfs or something), so I prefer to explicitly schedule jobs to the core group 0.
2) on BE-M1000 GPU seems to lock up in a few seconds after powering off some (GPU) cores. In fact I had to disable GPU devfreq to prevent GPU lockups.
Therefore I consider powering off unused cores as a later optimization. (frankly speaking I'd better put the effort to *making use* of those cores instead of figuring out why they fail to power down properly).
Best regards, Alexey
Hi, Alyssa,
On Mon, Jan 10, 2022 at 12:42:44PM -0500, Alyssa Rosenzweig wrote:
Whether it's worth the effort depends on whether anyone really cares about getting the full performance out of this particular GPU.
At this stage I think the main UABI change would be to add the opposite flag to kbase, (e.g. "PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU"[1]) to opt-in to allowing the job to run across all cores.
The second change would be to allow compute jobs to be run on the second core group, so another flag: PANFROST_RUN_ON_SECOND_CORE_GROUP.
But clearly there's little point adding such flags until someone steps up to do the Mesa work.
I worry about the maintainence burden (both Mesa and kernel) of adding UABI only used by a piece of hardware none of us own, and only useful
To solve the "no hardware" problem we can send you (or any interested panfrost hacker) a BE-M1000 based board (for free). BE-M1000 is 8 core armv8 (Cortex A53) SoC with Mali T628 GPU. Plese email me for the further details (preferably off the list) if you are interested.
Best regards, Alexey
On a dual core group GPUs (such as T628) fragment shading can be performed over all cores (because a fragment shader job doesn't need coherency between threads), however vertex shading requires to be run on the same core group as the tiler (which always lives in core group 0).
As a first step to support T628 power on only the first core group (so no jobs are scheduled on the second one). This makes T628 look like every other Midgard GPU (and throws away up to half the cores).
With this patch panfrost is able to drive T628 (r1p0) GPU on some armv8 SoCs (in particular BE-M1000). Without the patch rendering is horriby broken (desktop is completely unusable) and eventually the GPU locks up (it takes from a few seconds to a couple of minutes).
Using the second core group requires support in Mesa (and an UABI change): the userspace should
1) set PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU flag to opt-in to allowing the job to run across all cores. 2) set PANFROST_RUN_ON_SECOND_CORE_GROUP flag to allow compute jobs to be run on the second core group (at the moment Mesa does not advertise compute support on anything older than Mali T760)
But there's little point adding such flags until someone (myself) steps up to do the Mesa work.
Signed-off-by: Alexey Sheplyakov asheplyakov@basealt.ru Signed-off-by: Vadim V. Vlasov vadim.vlasov@elpitech.ru Tested-by: Alexey Sheplyakov asheplyakov@basealt.ru --- drivers/gpu/drm/panfrost/panfrost_gpu.c | 27 ++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index f8355de6e335..15cec831a99a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -320,19 +320,36 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev) { int ret; u32 val; + u64 core_mask = U64_MAX;
panfrost_gpu_init_quirks(pfdev);
- /* Just turn on everything for now */ - gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present); + if (pfdev->features.l2_present != 1) { + /* + * Only support one core group now. + * ~(l2_present - 1) unsets all bits in l2_present except + * the bottom bit. (l2_present - 2) has all the bits in + * the first core group set. AND them together to generate + * a mask of cores in the first core group. + */ + core_mask = ~(pfdev->features.l2_present - 1) & + (pfdev->features.l2_present - 2); + dev_info_once(pfdev->dev, "using only 1st core group (%lu cores from %lu)\n", + hweight64(core_mask), + hweight64(pfdev->features.shader_present)); + } + gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present & core_mask); ret = readl_relaxed_poll_timeout(pfdev->iomem + L2_READY_LO, - val, val == pfdev->features.l2_present, 100, 20000); + val, val == (pfdev->features.l2_present & core_mask), + 100, 20000); if (ret) dev_err(pfdev->dev, "error powering up gpu L2");
- gpu_write(pfdev, SHADER_PWRON_LO, pfdev->features.shader_present); + gpu_write(pfdev, SHADER_PWRON_LO, + pfdev->features.shader_present & core_mask); ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_READY_LO, - val, val == pfdev->features.shader_present, 100, 20000); + val, val == (pfdev->features.shader_present & core_mask), + 100, 20000); if (ret) dev_err(pfdev->dev, "error powering up gpu shader");
On 15/01/2022 16:06, Alexey Sheplyakov wrote:
On a dual core group GPUs (such as T628) fragment shading can be performed over all cores (because a fragment shader job doesn't need coherency between threads), however vertex shading requires to be run on the same core group as the tiler (which always lives in core group 0).
As a first step to support T628 power on only the first core group (so no jobs are scheduled on the second one). This makes T628 look like every other Midgard GPU (and throws away up to half the cores).
With this patch panfrost is able to drive T628 (r1p0) GPU on some armv8 SoCs (in particular BE-M1000). Without the patch rendering is horriby broken (desktop is completely unusable) and eventually the GPU locks up (it takes from a few seconds to a couple of minutes).
Using the second core group requires support in Mesa (and an UABI change): the userspace should
- set PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU flag to opt-in to allowing the job to run across all cores.
- set PANFROST_RUN_ON_SECOND_CORE_GROUP flag to allow compute jobs to be run on the second core group (at the moment Mesa does not advertise compute support on anything older than Mali T760)
Minor comment: these flags obviously don't exist yet, and I would prefer some more thought went into the names before we merge a new UABI. I picked overly verbose names (for the purposes of discussion) in the hope that it was 'obvious' I hadn't thought about the names, but in hindsight realise this perhaps wasn't obvious. As well as being too verbose, the second name is missing the "_JD_" part of the prefix.
But there's little point adding such flags until someone (myself) steps up to do the Mesa work.
Signed-off-by: Alexey Sheplyakov asheplyakov@basealt.ru Signed-off-by: Vadim V. Vlasov vadim.vlasov@elpitech.ru Tested-by: Alexey Sheplyakov asheplyakov@basealt.ru
Since I basically wrote the patch you should have included a credit here for me (see the developer certificate of origin[1]). But thank you for writing up a detailed commit message.
I'll push to drm-misc-next with a Co-developed-by for me. Thanks for testing the patch and good luck with the Mesa changes!
Thanks,
Steve
[1] https://developercertificate.org/
drivers/gpu/drm/panfrost/panfrost_gpu.c | 27 ++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index f8355de6e335..15cec831a99a 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -320,19 +320,36 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev) { int ret; u32 val;
u64 core_mask = U64_MAX;
panfrost_gpu_init_quirks(pfdev);
- /* Just turn on everything for now */
- gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present);
- if (pfdev->features.l2_present != 1) {
/*
* Only support one core group now.
* ~(l2_present - 1) unsets all bits in l2_present except
* the bottom bit. (l2_present - 2) has all the bits in
* the first core group set. AND them together to generate
* a mask of cores in the first core group.
*/
core_mask = ~(pfdev->features.l2_present - 1) &
(pfdev->features.l2_present - 2);
dev_info_once(pfdev->dev, "using only 1st core group (%lu cores from %lu)\n",
hweight64(core_mask),
hweight64(pfdev->features.shader_present));
- }
- gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present & core_mask); ret = readl_relaxed_poll_timeout(pfdev->iomem + L2_READY_LO,
val, val == pfdev->features.l2_present, 100, 20000);
val, val == (pfdev->features.l2_present & core_mask),
if (ret) dev_err(pfdev->dev, "error powering up gpu L2");100, 20000);
- gpu_write(pfdev, SHADER_PWRON_LO, pfdev->features.shader_present);
- gpu_write(pfdev, SHADER_PWRON_LO,
ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_READY_LO,pfdev->features.shader_present & core_mask);
val, val == pfdev->features.shader_present, 100, 20000);
val, val == (pfdev->features.shader_present & core_mask),
if (ret) dev_err(pfdev->dev, "error powering up gpu shader");100, 20000);
dri-devel@lists.freedesktop.org