It looks like a bug has crept in to etnaviv between 4.16 and 4.17, which causes etnaviv to misbehave with the GC600 GPU on Dove. I don't think it's a GPU issue, I think it's a DRM issue.
I get multiple:
[ 596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU! [ 597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle
while Xorg is starting up. Ignore the "failed to reset", that just seems to be a property of the GC600, and of course is a subsequent issue after the primary problem.
Looking at the devcoredump:
00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS-
So, all units on the GC600 were idle except for the front end.
00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle 00000664 = 102d06d8 Command DMA address 00000668 = 380000c8 FE fetched word 0 0000066c = 0000001f FE fetched word 1
The front end was basically idle at this point, at a WAIT 200 command. Digging through the ring:
00688: 08010e01 00000040 LDST 0x3804=0x00000040 00690: 40000002 102d06a0 LINK 0x102d06a0 00698: 40000002 102d0690 LINK 0x102d0690 006a0: 08010e04 0000001f LDST 0x3810=0x0000001f 006a8: 40000025 102d3000 LINK 0x102d3000 006b0: 08010e03 00000008 LDST 0x380c=0x00000008 Flush PE2D 006b8: 08010e02 00000701 LDST 0x3808=0x00000701 SEM FE -> PE 006c0: 48000000 00000701 STALL FE -> PE 006c8: 08010e01 00000041 LDST 0x3804=0x00000041 006d0: 380000c8(0000001f) WAIT 200 006d8: 40000002 102d06d0 LINK 0x102d06d0 <===========
We've basically come to the end of the currently issued command stream and hit the wait-link loop. Everything else in the devcoredump looks normal.
So, I think etnaviv DRM has missed an event signalled from the GPU.
This worked fine in 4.16, so seems to be a regression.
Hi Russell,
Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux:
It looks like a bug has crept in to etnaviv between 4.16 and 4.17, which causes etnaviv to misbehave with the GC600 GPU on Dove. I don't think it's a GPU issue, I think it's a DRM issue.
I get multiple:
[ 596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU! [ 597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle
while Xorg is starting up. Ignore the "failed to reset", that just seems to be a property of the GC600, and of course is a subsequent issue after the primary problem.
Looking at the devcoredump:
00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS-
So, all units on the GC600 were idle except for the front end.
00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle 00000664 = 102d06d8 Command DMA address 00000668 = 380000c8 FE fetched word 0 0000066c = 0000001f FE fetched word 1
The front end was basically idle at this point, at a WAIT 200 command. Digging through the ring:
00688: 08010e01 00000040 LDST 0x3804=0x00000040 00690: 40000002 102d06a0 LINK 0x102d06a0 00698: 40000002 102d0690 LINK 0x102d0690 006a0: 08010e04 0000001f LDST 0x3810=0x0000001f 006a8: 40000025 102d3000 LINK 0x102d3000 006b0: 08010e03 00000008 LDST 0x380c=0x00000008 Flush PE2D 006b8: 08010e02 00000701 LDST 0x3808=0x00000701 SEM FE -> PE 006c0: 48000000 00000701 STALL FE -> PE 006c8: 08010e01 00000041 LDST 0x3804=0x00000041 006d0: 380000c8(0000001f) WAIT 200
006d8: 40000002 102d06d0 LINK 0x102d06d0 <===========
We've basically come to the end of the currently issued command stream and hit the wait-link loop. Everything else in the devcoredump looks normal.
So, I think etnaviv DRM has missed an event signalled from the GPU.
I don't see what would make us miss a event suddenly.
This worked fine in 4.16, so seems to be a regression.
The only thing that comes to mind is that with the DRM scheduler we enforce a job timeout of 500ms, without the previous logic to allow a job to run indefinitely as long as it makes progress, as this is a serious QoS issue.
This might bite you at this point, if Xorg manages to submit a really big job. The coredump might be delayed enough that it captures the state of the GPU when it has managed to finish the job after the job timeout was hit.
Can you try if changing the timeout value to something large in drm_sched_init() in etnaviv_sched.c makes any difference?
Regards, Lucas
On Tue, Jun 19, 2018 at 12:09:16PM +0200, Lucas Stach wrote:
Hi Russell,
Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux:
It looks like a bug has crept in to etnaviv between 4.16 and 4.17, which causes etnaviv to misbehave with the GC600 GPU on Dove. I don't think it's a GPU issue, I think it's a DRM issue.
I get multiple:
[ 596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU! [ 597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle
while Xorg is starting up. Ignore the "failed to reset", that just seems to be a property of the GC600, and of course is a subsequent issue after the primary problem.
Looking at the devcoredump:
00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS-
So, all units on the GC600 were idle except for the front end.
00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle 00000664 = 102d06d8 Command DMA address 00000668 = 380000c8 FE fetched word 0 0000066c = 0000001f FE fetched word 1
The front end was basically idle at this point, at a WAIT 200 command. Digging through the ring:
00688: 08010e01 00000040 LDST 0x3804=0x00000040 00690: 40000002 102d06a0 LINK 0x102d06a0 00698: 40000002 102d0690 LINK 0x102d0690 006a0: 08010e04 0000001f LDST 0x3810=0x0000001f 006a8: 40000025 102d3000 LINK 0x102d3000 006b0: 08010e03 00000008 LDST 0x380c=0x00000008 Flush PE2D 006b8: 08010e02 00000701 LDST 0x3808=0x00000701 SEM FE -> PE 006c0: 48000000 00000701 STALL FE -> PE 006c8: 08010e01 00000041 LDST 0x3804=0x00000041 006d0: 380000c8(0000001f) WAIT 200
006d8: 40000002 102d06d0 LINK 0x102d06d0 <===========
We've basically come to the end of the currently issued command stream and hit the wait-link loop. Everything else in the devcoredump looks normal.
So, I think etnaviv DRM has missed an event signalled from the GPU.
I don't see what would make us miss a event suddenly.
This worked fine in 4.16, so seems to be a regression.
The only thing that comes to mind is that with the DRM scheduler we enforce a job timeout of 500ms, without the previous logic to allow a job to run indefinitely as long as it makes progress, as this is a serious QoS issue.
That is probably what's going on then - the GC600 is not particularly fast when dealing with 1080p resolutions.
I think what your commit to use the DRM scheduler is missing is the progress detection in the original scheme - we used to assume that if the GPU FE DMA address had progressed, that the GPU was not hung. Now it seems we merely do this by checking for events.
This might bite you at this point, if Xorg manages to submit a really big job. The coredump might be delayed enough that it captures the state of the GPU when it has managed to finish the job after the job timeout was hit.
No, it's not "a really big job" - it's just that the Dove GC600 is not fast enough to complete _two_ 1080p sized GPU operations within 500ms. The preceeding job contained two blits - one of them a non-alphablend copy of:
00180000 04200780 0,24,1920,1056 -> 0,24,1920,1056
and one an alpha blended copy of:
00000000 04380780 0,0,1920,1080 -> 0,0,1920,1080
This is (iirc) something I already fixed with the addition of the progress detection back before etnaviv was merged into the mainline kernel.
Am Dienstag, den 19.06.2018, 12:00 +0100 schrieb Russell King - ARM Linux:
On Tue, Jun 19, 2018 at 12:09:16PM +0200, Lucas Stach wrote:
Hi Russell,
Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux:
It looks like a bug has crept in to etnaviv between 4.16 and 4.17, which causes etnaviv to misbehave with the GC600 GPU on Dove. I don't think it's a GPU issue, I think it's a DRM issue.
I get multiple:
[ 596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU! [ 597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle
while Xorg is starting up. Ignore the "failed to reset", that just seems to be a property of the GC600, and of course is a subsequent issue after the primary problem.
Looking at the devcoredump:
00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS-
So, all units on the GC600 were idle except for the front end.
00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle 00000664 = 102d06d8 Command DMA address 00000668 = 380000c8 FE fetched word 0 0000066c = 0000001f FE fetched word 1
The front end was basically idle at this point, at a WAIT 200 command. Digging through the ring:
00688: 08010e01 00000040 LDST 0x3804=0x00000040 00690: 40000002 102d06a0 LINK 0x102d06a0 00698: 40000002 102d0690 LINK 0x102d0690 006a0: 08010e04 0000001f LDST 0x3810=0x0000001f 006a8: 40000025 102d3000 LINK 0x102d3000 006b0: 08010e03 00000008 LDST 0x380c=0x00000008 Flush PE2D 006b8: 08010e02 00000701 LDST 0x3808=0x00000701 SEM FE -> PE 006c0: 48000000 00000701 STALL FE -> PE 006c8: 08010e01 00000041 LDST 0x3804=0x00000041 006d0: 380000c8(0000001f) WAIT 200
006d8: 40000002 102d06d0 LINK 0x102d06d0 <===========
We've basically come to the end of the currently issued command stream and hit the wait-link loop. Everything else in the devcoredump looks normal.
So, I think etnaviv DRM has missed an event signalled from the GPU.
I don't see what would make us miss a event suddenly.
This worked fine in 4.16, so seems to be a regression.
The only thing that comes to mind is that with the DRM scheduler we enforce a job timeout of 500ms, without the previous logic to allow a job to run indefinitely as long as it makes progress, as this is a serious QoS issue.
That is probably what's going on then - the GC600 is not particularly fast when dealing with 1080p resolutions.
I think what your commit to use the DRM scheduler is missing is the progress detection in the original scheme - we used to assume that if the GPU FE DMA address had progressed, that the GPU was not hung. Now it seems we merely do this by checking for events.
It was a deliberate decision to remove this, as it's a potential DoS vector where a rogue client is able to basically starve all other clients of GPU access by submitting a job that runs for a very long time, as long as it's making some progress.
This might bite you at this point, if Xorg manages to submit a really big job. The coredump might be delayed enough that it captures the state of the GPU when it has managed to finish the job after the job timeout was hit.
No, it's not "a really big job" - it's just that the Dove GC600 is not fast enough to complete _two_ 1080p sized GPU operations within 500ms. The preceeding job contained two blits - one of them a non-alphablend copy of:
00180000 04200780 0,24,1920,1056 -> 0,24,1920,1056
and one an alpha blended copy of:
00000000 04380780 0,0,1920,1080 -> 0,0,1920,1080
This is (iirc) something I already fixed with the addition of the progress detection back before etnaviv was merged into the mainline kernel.
I hadn't expected it to be this slow. I see that we might need to bring back the progress detection to fix the userspace regression, but I'm not fond of this, as it might lead to really bad QoS.
I would prefer userspace tracking the size of the blits and flushing the cmdstream at an appropriate time, so we don't end up with really long running jobs, but I'm not sure if this would be acceptable to you...
Regards, Lucas
On Tue, Jun 19, 2018 at 01:11:29PM +0200, Lucas Stach wrote:
Am Dienstag, den 19.06.2018, 12:00 +0100 schrieb Russell King - ARM Linux:
No, it's not "a really big job" - it's just that the Dove GC600 is not fast enough to complete _two_ 1080p sized GPU operations within 500ms. The preceeding job contained two blits - one of them a non-alphablend copy of:
00180000 04200780 0,24,1920,1056 -> 0,24,1920,1056
and one an alpha blended copy of:
00000000 04380780 0,0,1920,1080 -> 0,0,1920,1080
This is (iirc) something I already fixed with the addition of the progress detection back before etnaviv was merged into the mainline kernel.
I hadn't expected it to be this slow. I see that we might need to bring back the progress detection to fix the userspace regression, but I'm not fond of this, as it might lead to really bad QoS.
Well, the choices are that or worse overall performance through having to ignore the GPU entirely.
I would prefer userspace tracking the size of the blits and flushing the cmdstream at an appropriate time, so we don't end up with really long running jobs, but I'm not sure if this would be acceptable to you...
The question becomes how to split up two operations. Yes, we could submit them individually, but if they're together taking in excess of 500ms, then it's likely that individually, each operation will take in excess of 250ms which is still a long time.
In any case, I think we need to fix this for 4.17-stable and then try to work (a) which operations are taking a long time, and (b) how to solve this issue.
Do we have any way to track how long each submitted job has actually taken on the GPU? (Eg, by recording the times that we receive the events?) It wouldn't be very accurate for small jobs, but given this operation is taking so long, it would give an indication of how long this operation is actually taking. etnaviv doesn't appear to have any tracepoints, which would've been ideal for that. Maybe this is a reason to add some? ;)
Am Dienstag, den 19.06.2018, 12:42 +0100 schrieb Russell King - ARM Linux:
On Tue, Jun 19, 2018 at 01:11:29PM +0200, Lucas Stach wrote:
Am Dienstag, den 19.06.2018, 12:00 +0100 schrieb Russell King - ARM Linux:
No, it's not "a really big job" - it's just that the Dove GC600 is not fast enough to complete _two_ 1080p sized GPU operations within 500ms. The preceeding job contained two blits - one of them a non-alphablend copy of:
00180000 04200780 0,24,1920,1056 -> 0,24,1920,1056
and one an alpha blended copy of:
00000000 04380780 0,0,1920,1080 -> 0,0,1920,1080
This is (iirc) something I already fixed with the addition of the progress detection back before etnaviv was merged into the mainline kernel.
I hadn't expected it to be this slow. I see that we might need to bring back the progress detection to fix the userspace regression, but I'm not fond of this, as it might lead to really bad QoS.
Well, the choices are that or worse overall performance through having to ignore the GPU entirely.
I would prefer userspace tracking the size of the blits and flushing the cmdstream at an appropriate time, so we don't end up with really long running jobs, but I'm not sure if this would be acceptable to you...
The question becomes how to split up two operations. Yes, we could submit them individually, but if they're together taking in excess of 500ms, then it's likely that individually, each operation will take in excess of 250ms which is still a long time.
In any case, I think we need to fix this for 4.17-stable and then try to work (a) which operations are taking a long time, and (b) how to solve this issue.
Agreed. I'll look into bringing back the process detection for 4.17 stable.
I'm still curious why the GC600 on the Dove is that slow. With performance like this moving a big(ish) window on the screen must be a horrible user experience.
Do we have any way to track how long each submitted job has actually taken on the GPU? (Eg, by recording the times that we receive the events?) It wouldn't be very accurate for small jobs, but given this operation is taking so long, it would give an indication of how long this operation is actually taking. etnaviv doesn't appear to have any tracepoints, which would've been ideal for that. Maybe this is a reason to add some? ;)
See attached patch (which I apparently forgot to send out). The DRM GPU scheduler has some tracepoints, which might be helpful. The attached patch adds a drm_sched_job_run tracepoint when a job is queued in the hardware ring. Together with the existing drm_sched_process_job, this should get you an idea how long a job takes to process. Note that at any time up to 4 jobs are allowed in the hardware queue, so you need to match up the end times.
Regards, Lucas
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
Am Dienstag, den 19.06.2018, 12:42 +0100 schrieb Russell King - ARM Linux:
On Tue, Jun 19, 2018 at 01:11:29PM +0200, Lucas Stach wrote:
Am Dienstag, den 19.06.2018, 12:00 +0100 schrieb Russell King - ARM Linux:
No, it's not "a really big job" - it's just that the Dove GC600 is not fast enough to complete _two_ 1080p sized GPU operations within 500ms. The preceeding job contained two blits - one of them a non-alphablend copy of:
00180000 04200780 0,24,1920,1056 -> 0,24,1920,1056
and one an alpha blended copy of:
00000000 04380780 0,0,1920,1080 -> 0,0,1920,1080
This is (iirc) something I already fixed with the addition of the progress detection back before etnaviv was merged into the mainline kernel.
I hadn't expected it to be this slow. I see that we might need to bring back the progress detection to fix the userspace regression, but I'm not fond of this, as it might lead to really bad QoS.
Well, the choices are that or worse overall performance through having to ignore the GPU entirely.
I would prefer userspace tracking the size of the blits and flushing the cmdstream at an appropriate time, so we don't end up with really long running jobs, but I'm not sure if this would be acceptable to you...
The question becomes how to split up two operations. Yes, we could submit them individually, but if they're together taking in excess of 500ms, then it's likely that individually, each operation will take in excess of 250ms which is still a long time.
In any case, I think we need to fix this for 4.17-stable and then try to work (a) which operations are taking a long time, and (b) how to solve this issue.
Agreed. I'll look into bringing back the process detection for 4.17 stable.
I'm still curious why the GC600 on the Dove is that slow. With performance like this moving a big(ish) window on the screen must be a horrible user experience.
I _think_ it's down to the blend being slow on GC600 - one of the problems of running modern "desktops" on the Dove is that with Xorg and a compositing window manager (eg, modern metacity) then yes, dragging windows around is very slow because of the multiple GPU operations required - even dragging a small window results in almost the entire screen being re-blended.
I don't think that's fair to blame on the Dove though - that's just total inefficiency on the Xorg/compositing side to basically redraw the _entire_ screen for small changes.
The compositing window manager brings with it other issues as well, in particular with colour-keyed overlay and detecting whether anything obscures the overlay. For example, if, as a memory bandwidth optimisation, you detect that the overlay window is unobscured in the Xvideo extension, and disable the primary plane and colourkeying, this works fine with non-compositing managers. However, with a compositing manager, you can end up with there being some graphics that is blended _on top_ of the Xvideo window which is unknown to the Xvideo backend... which results in the graphics not being displayed.
The blending also has a detrimental effect on the colourkeying when the graphics is displayed - because of the blending, the colourkey is no longer the expected RGB value around objects, so you get the colourkey bleeding through around (eg) a menu.
I've now disabled compositing in metacity which makes things a whole lot nicer (I've actually been meaning to track down the "slow window dragging" problem for a good few months now) and solves the overlay issue too.
Do we have any way to track how long each submitted job has actually taken on the GPU? (Eg, by recording the times that we receive the events?) It wouldn't be very accurate for small jobs, but given this operation is taking so long, it would give an indication of how long this operation is actually taking. etnaviv doesn't appear to have any tracepoints, which would've been ideal for that. Maybe this is a reason to add some? ;)
See attached patch (which I apparently forgot to send out). The DRM GPU scheduler has some tracepoints, which might be helpful. The attached patch adds a drm_sched_job_run tracepoint when a job is queued in the hardware ring. Together with the existing drm_sched_process_job, this should get you an idea how long a job takes to process. Note that at any time up to 4 jobs are allowed in the hardware queue, so you need to match up the end times.
Thanks, I'll try to get some data in the next week or so.
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
Agreed. I'll look into bringing back the process detection for 4.17 stable.
I'm still curious why the GC600 on the Dove is that slow. With performance like this moving a big(ish) window on the screen must be a horrible user experience.
This doesn't seem to be the cause - it seems that there's something going on with 4.17 that really is causing the Dove GC600 to get stuck. Reverting all the drivers/gpu/drm/etnaviv changes back to the 4.16 state while keeping everything else the same results in no hangs, whereas increasing the timeout with 4.17 still results in hangs.
I'll try to find some time to bisect.
On Tue, Jun 26, 2018 at 09:17:26AM +0100, Russell King - ARM Linux wrote:
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
Agreed. I'll look into bringing back the process detection for 4.17 stable.
I'm still curious why the GC600 on the Dove is that slow. With performance like this moving a big(ish) window on the screen must be a horrible user experience.
This doesn't seem to be the cause - it seems that there's something going on with 4.17 that really is causing the Dove GC600 to get stuck. Reverting all the drivers/gpu/drm/etnaviv changes back to the 4.16 state while keeping everything else the same results in no hangs, whereas increasing the timeout with 4.17 still results in hangs.
I'll try to find some time to bisect.
Sorry, it seems that my attempts to change what was running on the system were ineffective (due to the etnaviv module loaded from the initramfs, not from the fs copy I was updating.) Extending the timeout to 5 seconds does indeed stop the issue.
More importantly, it stops some memory corruption I've observed as well, caused by etnaviv freeing buffers when it thinks the GPU has timed out while the GPU is still writing to them.
Hi Russell,
Am Dienstag, den 26.06.2018, 16:36 +0100 schrieb Russell King - ARM Linux:
On Tue, Jun 26, 2018 at 09:17:26AM +0100, Russell King - ARM Linux wrote:
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
Agreed. I'll look into bringing back the process detection for 4.17 stable.
I'm still curious why the GC600 on the Dove is that slow. With performance like this moving a big(ish) window on the screen must be a horrible user experience.
This doesn't seem to be the cause - it seems that there's something going on with 4.17 that really is causing the Dove GC600 to get stuck. Reverting all the drivers/gpu/drm/etnaviv changes back to the 4.16 state while keeping everything else the same results in no hangs, whereas increasing the timeout with 4.17 still results in hangs.
I'll try to find some time to bisect.
Sorry, it seems that my attempts to change what was running on the system were ineffective (due to the etnaviv module loaded from the initramfs, not from the fs copy I was updating.) Extending the timeout to 5 seconds does indeed stop the issue.
Thanks for confirming that the issue is caused by the removed progress check.
More importantly, it stops some memory corruption I've observed as well, caused by etnaviv freeing buffers when it thinks the GPU has timed out while the GPU is still writing to them.
Urgh, that is really bad. I'll get a fix out of the door tomorrow.
Regards, Lucas
Hi Russell,
Am Dienstag, den 26.06.2018, 17:48 +0200 schrieb Lucas Stach:
Hi Russell,
Am Dienstag, den 26.06.2018, 16:36 +0100 schrieb Russell King - ARM Linux:
On Tue, Jun 26, 2018 at 09:17:26AM +0100, Russell King - ARM Linux wrote:
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
Agreed. I'll look into bringing back the process detection for 4.17 stable.
I'm still curious why the GC600 on the Dove is that slow. With performance like this moving a big(ish) window on the screen must be a horrible user experience.
This doesn't seem to be the cause - it seems that there's something going on with 4.17 that really is causing the Dove GC600 to get stuck. Reverting all the drivers/gpu/drm/etnaviv changes back to the 4.16 state while keeping everything else the same results in no hangs, whereas increasing the timeout with 4.17 still results in hangs.
I'll try to find some time to bisect.
Sorry, it seems that my attempts to change what was running on the system were ineffective (due to the etnaviv module loaded from the initramfs, not from the fs copy I was updating.) Extending the timeout to 5 seconds does indeed stop the issue.
Thanks for confirming that the issue is caused by the removed progress check.
More importantly, it stops some memory corruption I've observed as well, caused by etnaviv freeing buffers when it thinks the GPU has timed out while the GPU is still writing to them.
Urgh, that is really bad. I'll get a fix out of the door tomorrow.
I've just sent a patch do to this. Can you please confirm that this fixes your issue on GC600?
Thanks, Lucas
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
See attached patch (which I apparently forgot to send out). The DRM GPU scheduler has some tracepoints, which might be helpful. The attached patch adds a drm_sched_job_run tracepoint when a job is queued in the hardware ring. Together with the existing drm_sched_process_job, this should get you an idea how long a job takes to process. Note that at any time up to 4 jobs are allowed in the hardware queue, so you need to match up the end times.
Here's the trace:
Xorg 1009 [000] 276.593018: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=144, fence=0xeb849580, ring=ef16bd00, job count:0, hw job count:0 gnome-software 1384 [000] 276.673634: gpu_scheduler:drm_sched_process_job: fence=0xeb849580 signaled Xorg 1009 [000] 278.756538: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=145, fence=0xecaaef40, ring=ef16bd00, job count:0, hw job count:0 Xorg 1009 [000] 279.130196: gpu_scheduler:drm_sched_process_job: fence=0xecaaef40 signaled Xorg 1009 [000] 279.856129: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=146, fence=0xecbdd7c0, ring=ef16bd00, job count:0, hw job count:0 Xorg 1009 [000] 279.856542: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=147, fence=0xeb849700, ring=ef16bd00, job count:0, hw job count:1 kswapd0 26 [000] 281.532808: gpu_scheduler:drm_sched_process_job: fence=0xecbdd7c0 signaled gnome-software 1384 [000] 281.560666: gpu_scheduler:drm_sched_process_job: fence=0xeb849700 signaled Xorg 1009 [000] 282.554920: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=148, fence=0xecad3a00, ring=ef16bd00, job count:0, hw job count:0 Xorg 1009 [000] 283.956216: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=149, fence=0xeb849700, ring=ef16bd00, job count:0, hw job count:1 Xorg 1009 [000] 284.149557: gpu_scheduler:drm_sched_process_job: fence=0xecad3a00 signaled Xorg 1009 [000] 285.465863: gpu_scheduler:drm_sched_process_job: fence=0xeb849700 signaled Xorg 1009 [000] 286.796692: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=150, fence=0xecbdd340, ring=ef16bd00, job count:0, hw job count:0 Xorg 1009 [000] 288.549507: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=151, fence=0xecbdd280, ring=ef16bd00, job count:0, hw job count:1 Xorg 1009 [000] 288.856277: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=152, fence=0xecaaee80, ring=ef16bd00, job count:0, hw job count:2 Xorg 1009 [000] 288.856713: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=153, fence=0xecad3040, ring=ef16bd00, job count:0, hw job count:3 gnome-software 1384 [000] 288.958764: gpu_scheduler:drm_sched_process_job: fence=0xecbdd340 signaled Xorg 1009 [000] 290.052842: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=154, fence=0xecbdde80, ring=ef16bd00, job count:0, hw job count:3 Xorg 1009 [000] 291.682222: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=155, fence=0xecaaeb80, ring=ef16bd00, job count:0, hw job count:4 gnome-software 1384 [000] 291.760295: gpu_scheduler:drm_sched_process_job: fence=0xecbdd280 signaled gnome-software 1384 [000] 292.018437: gpu_scheduler:drm_sched_process_job: fence=0xecaaee80 signaled Xorg 1009 [000] 292.046721: gpu_scheduler:drm_sched_process_job: fence=0xecad3040 signaled Xorg 1009 [000] 292.819475: gpu_scheduler:drm_sched_process_job: fence=0xecbdde80 signaled Xorg 1009 [000] 293.149829: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=156, fence=0xecad3f40, ring=ef16bd00, job count:0, hw job count:1 gnome-software 1384 [000] 293.190607: gpu_scheduler:drm_sched_process_job: fence=0xecaaeb80 signaled gnome-software 1384 [000] 293.190720: gpu_scheduler:drm_sched_process_job: fence=0xecad3f40 signaled Xorg 1009 [000] 293.463414: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=157, fence=0xecbddd00, ring=ef16bd00, job count:0, hw job count:0 Xorg 1009 [000] 293.463629: gpu_scheduler:drm_sched_process_job: fence=0xecbddd00 signaled
So: Job Start End Time on queue: 144 276.593018 276.673634 80ms 145 278.756538 279.130196 374ms 146 279.856129 281.532808 1.7s 147 279.856542 281.560666 1.7s (overlaps 146) 148 282.554920 284.149557 1.6s 149 283.956216 285.465863 1.5s (overlaps 148) 150 286.796692 288.958764 2.2s 151 288.549507 291.760295 3.2s (overlaps 150) 152 288.856277 292.018437 3.2s (overlaps 150-151) 153 288.856713 292.046721 3.2s (overlaps 150-152) 154 290.052842 292.819475 2.8s (overlaps 151-153) 155 291.682222 293.190607 1.5s (overlaps 151-154) 156 293.149829 293.190720 40ms (overlaps 155) 157 293.463414 293.463629 215us
The time for jobs 147, 149, 151-155 are misleading because they are delayed by the preceeding job. Correcting for that,
Time executing: 147 28ms 149 1.3s 151 2.8s 152 250ms 153 28ms 154 780ms 155 370ms
So, sorted by time apparently executing: 151 2.8s 150 286.796692 288.958764 2.2s 146 279.856129 281.532808 1.7s 148 282.554920 284.149557 1.6s 149 1.3s 154 780ms 145 278.756538 279.130196 374ms 155 370ms 152 250ms 144 276.593018 276.673634 80ms 156 293.149829 293.190720 40ms 153 28ms 147 28ms 157 293.463414 293.463629 215us
Of course, from the perf dump, we don't know what the GPU was being asked to do.
I've applied your forward-progress patch, and with the 5s timeout reverted, it seems to "behave" without causing a timeout.
dri-devel@lists.freedesktop.org