I'm having an issue where a long-running test eventually runs into a MMU fault. What this test does is basically:
- while [ 1 ]; do start a program that: - Allocate bo A, B and C, D - Map bo C, update it - Loop - Map bo A B and C, update them - Build command buffer - Submit command buffer - etna_cmd_stream_finish - Map buffer A, check output - Delete buffer A, B, C and D - Exit program (code is here: https://github.com/etnaviv/etnaviv_gpu_tests/blob/master/src/etnaviv_verifyo...)
The curious thing is that after the fault happens once, it keeps running into the same fault almost immediately, even after a GPU reset. This made me suspect it has to do with kernel driver state not GPU state.
I added some debugging in the kernel driver in etnaviv_iommu_find_iova:
<4>[ 549.776209] Found iova: 00000000 eff82000 <4>[ 549.780712] Found iova: 00000000 eff93000 <4>[ 549.785173] Found iova: 00000000 effa4000 <4>[ 549.789706] Found iova: 00000000 effb5000 <4>[ 549.794167] Found iova: 00000000 effc6000 <4>[ 549.798686] Found iova: 00000000 effd7000 <4>[ 549.803171] Found iova: 00000000 effe8000 <4>[ 549.803171] Found iova: 00000000 effe8000 <4>[ 549.807680] last_iova <- end of range <4>[ 549.809966] Found iova: 00000000 e8783000 <3>[ 549.814025] etnaviv-gpu 130000.gpu: MMU fault status 0x00000002 <- happens almost immediately <3>[ 549.819960] etnaviv-gpu 130000.gpu: MMU 0 fault addr 0xe8783040 <3>[ 549.825889] etnaviv-gpu 130000.gpu: MMU 1 fault addr 0x00000000 <3>[ 549.831817] etnaviv-gpu 130000.gpu: MMU 2 fault addr 0x00000000 <3>[ 549.837744] etnaviv-gpu 130000.gpu: MMU 3 fault addr 0x00000000
Apparently it is running out of the address space. (I changed the end of the range to 0xf0000000 instead of 0xffffffff to rule out that it had to do with the GPU disliking certain addresses)
In principle this shouldn't be an issue - after last_iova it starts over, with a flushed MMU. I verified that this flush is actually being queued in etnaviv_buffer_queue.
However for some reason that logic doesn't seem to be working. I have not found out what is wrong yet. I have not verified whether the MMU flush is actually flushing, or whether this is a problem with updating the page tables.
What I find curious, though, is that after the search presumably starts over at 0 it returns 0xe8783000 instead of an earlier address. For this reason last_iova is stuck near the end of the address space and the problem keeps repeating once it's been hit.
It's certainly possible that I'm doing something dumb here and am somehow spamming full the address space :)
Wladimir
Okay I just tried to get the same while rendering in Mesa and it doesn't happen.
It reaches the end of the address space, sets last_iova back to 0, and just continues.
So the MMU fault is somehow specific to what I'm doing. Interesting.
This does happen when rendering - it keeps dealing out iovas near the end of the address space. But that seems harmless, though maybe causes some more MMU flushes than necessary.
Wladimir
So the MMU fault is somehow specific to what I'm doing. Interesting.
I think I found the issue: the MMU "flush and sync" is not good enough in some cases.
What the Vivante kernel driver does, for MMUv2, after mapping some kinds of buffer objects (apparently those tagged INDEX and VERTEX, this includes shader code and CL buffers) is
- Send MMU flush command (like we do) - Add a notify event "resume" (they hardwire event 29 for this) - Add END command the command buffer so that the FE stops - Remember where to continue
Then in the interrupt handler:
- If the "resume" notify event comes in - Wait for FE to be idle - Restart the FE to the remembered position
This is implemented in "pause" here http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/tree/drivers/mxc... gcvPAGE_TABLE_DIRTY_BIT_FE is set here: http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/tree/drivers/mxc... endAfterFlushMmuCache is set here: http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/tree/drivers/mxc... The interrupt notification is handled here: http://git.freescale.com/git/cgit.cgi/imx/linux-2.6-imx.git/tree/drivers/mxc...
I hacked this into the DRM driver and have been running my test for quite some time, bumping against the tail end of the address range many times, without any MMU faults.
My proposal is to add a bo flag for buffers that need this kind of "hard" MMU reset (this is not all of them, e.g. textures don't), and if their iova mapping requires a MMU flush, do the above stop-and-start ritual (in case of MMUv2).
Wladimir
Hi Wladimir,
Am Samstag, den 10.12.2016, 18:05 +0100 schrieb Wladimir J. van der Laan:
I'm aware of what the Vivante driver does. Unfortunately we would basically need to flush the MMU before each user command stream, as we continuously map new command buffers into the IOVA, which would be crippling for performance. Vivante gets around this by setting up a 1:1 virt:phys mapping by default.
The current etnaviv code gets around this stop->irq->start dance by spacing out the command streams, which seems to be enough to get around the FE MMU flush failure. This may not work correctly at the end of the address range. I'll take a look at this.
Blindly implementing the Vivante way does not seem like the correct approach to me.
Regards, Lucas
In my case it seems not a command buffer that this is happening for, but another bo used by the command buffer.
Blindly implementing the Vivante way does not seem like the correct approach to me.
I'm not suggesting that that is a good solution! Just needed to do that to narrow down the issue, as well as get rid of it for now.
Regards, Wladimir
dri-devel@lists.freedesktop.org