https://bugs.freedesktop.org/show_bug.cgi?id=107545
Bug ID: 107545 Summary: radeon - ring 0 stalled - GPU lockup - SI Product: DRI Version: XOrg git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: DRM/Radeon Assignee: dri-devel@lists.freedesktop.org Reporter: julien.isorce@gmail.com
* Steps to reproduce: for i in {0..300}; do (glxgears &); done (note that 100 might be enough instead of 300)
* Actual result: ring 0 stalled, gpu locakup, reset and x11 stops and cannot restart. The only way is to reboot.
* Expected result: The fps goes very low the more there are glxgears instances and no gpu lockup, like with intel gpu.
* Infos: card W600 mesa 18.2, kernel 4.15.0-15-generic, LLVM 7.0.0 xorg 1.20.99.1 xf86-video-ati 18.0.1. (same result with kernel 4.4.0-130, mesa 12.0.6, llvm 3.8.0, DRM 2.43.0)
I was playing with the apitrace here https://bugs.freedesktop.org/show_bug.cgi?id=87278#c31 and decided to through dozens of glxgears instances to see.
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #1 from Julien Isorce julien.isorce@gmail.com --- Created attachment 141084 --> https://bugs.freedesktop.org/attachment.cgi?id=141084&action=edit simple.c
Minimal test to reproduce the issue by just drawing 2 lines. Run: for i in {0..300}; do (./simple &); done
gcc -Wall simple.c -o simple $(pkg-config --cflags --libs gl x11)
Also happens with R600_DEBUG=nodma,nowc,nodcc
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Julien Isorce julien.isorce@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #141084|0 |1 is obsolete| |
--- Comment #2 from Julien Isorce julien.isorce@gmail.com --- Created attachment 141202 --> https://bugs.freedesktop.org/attachment.cgi?id=141202&action=edit simple.c
Minimized the repro test even more using just a pixmap (no window) and 1 glVertex (GL_POINTS).
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #3 from Julien Isorce julien.isorce@gmail.com --- Created attachment 141203 --> https://bugs.freedesktop.org/attachment.cgi?id=141203&action=edit cs_dump_user_space.txt
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #4 from Julien Isorce julien.isorce@gmail.com --- Created attachment 141204 --> https://bugs.freedesktop.org/attachment.cgi?id=141204&action=edit cs_dum_kernel_space.txt
Packet0 not allowed!.
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #5 from Julien Isorce julien.isorce@gmail.com --- Extract of the 2 attached cs dumps:
User space so before ioctl radeon_cs_ioctl: 0x00000290 0x00000000 0xC0016900 0x000002A1
Kernel space so in radeon_cs_ioctl: 0x00000290 0x0000000b 0x00000000 0x000002a1
So for some reasons 0x00000000C0016900 gets overwritten by 0x0000000b00000000
Note that it always get overwritten with this value above and this value also appears in the other packet0 bug report: https://bugs.freedesktop.org/show_bug.cgi?id=84500#c7
I have started to narrow down the issue and it looks like it happens in "radeon_cs_parser_init" in kernel/drivers/gpu/drm/radeon as the overwrtting is already present just after this function. But it is not easy to debug further as this function is quite difficult to understand so any inputs would be appreciated, thx!
Does kernel space make a copy of the cs chunks or just keep a pointer on it, as I see "user_ptr" ?
Also note that the issue does not happen with amdgpu so one possibility is that "amdgpu_cs_parser_init" is more robust.
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #6 from Christopher me@pc-networking-services.com --- Created attachment 141263 --> https://bugs.freedesktop.org/attachment.cgi?id=141263&action=edit dmsg output running on wayland
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #7 from Christopher me@pc-networking-services.com --- Hello,
I am getting similar issues with regards to fence wait timeouts. However I have narrowed it further to it ONLY happening when gnome is running on xorg.
I have over the past month or so rebuilt my system from the ground up. I am NOT using a distro that holds peoples hands with package managers and bloated useless kernel modules. I use instructions from linuxfromscratch.org to build the entire system from the latest stable sourcecode.
After I first boot into gnome, with it running on xorg, as soon as I have logged in and click on activities on the gnome menu and select terminal, then the little circle starts twirling, and after a few seconds the screen flashes, and it momentarily goes to the grey login background, then flashes to what can only be described as a mini pixal dump, then after a while it flashes back to the login screen again and you need to login again. At this point, if you click on the drop down list to see the types of login session available, gnome on xorg is missing from the list. At this stage I login and going back and activating gnome terminal is successfull, however the dmesg log shows that it has ring stalled errors, and the dreaded parser error that has been mentioned here.
If I start gnome on wayland, and then proceed to click on activities and then on terminal to bring up gnome terminal, even though the circle twirls for a long time after, the terminal window opens almost immediately and the output of dmesg is free of the ring timeouts.
Running xorg by itself using twm with clock and xterm also produces a clean dmesg log.
Please find the results attached for both boot tests. By the way this is on one of the latest versions of the 4.18 kernel series available on kernel.org.
The version of Mesa used is: mesa-18.1.5
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #8 from Christopher me@pc-networking-services.com --- Created attachment 141264 --> https://bugs.freedesktop.org/attachment.cgi?id=141264&action=edit dmsg output running on xorg
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #9 from Michel Dänzer michel@daenzer.net --- (In reply to Christopher from comment #7)
After I first boot into gnome, with it running on xorg, as soon as I have logged in and click on activities on the gnome menu and select terminal, then the little circle starts twirling, and after a few seconds the screen flashes, and it momentarily goes to the grey login background, then flashes to what can only be described as a mini pixal dump, then after a while it flashes back to the login screen again and you need to login again.
You're running into bug 105381 , unrelated to this report, fixed in xf86-video-ati Git master.
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #10 from Christopher me@pc-networking-services.com --- (In reply to Michel Dänzer from comment #9)
(In reply to Christopher from comment #7)
After I first boot into gnome, with it running on xorg, as soon as I have logged in and click on activities on the gnome menu and select terminal, then the little circle starts twirling, and after a few seconds the screen flashes, and it momentarily goes to the grey login background, then flashes to what can only be described as a mini pixal dump, then after a while it flashes back to the login screen again and you need to login again.
You're running into bug 105381 , unrelated to this report, fixed in xf86-video-ati Git master.
Hello Michel,
Thank you for taking the time to respond. After doing a git pull of xf86-video-ati, compiling, installing and re-booting, this has indeed solved the issue for me.
It really is next to impossible with the range of drivers that could have been the source of the error to know where to actually post a bug report. I am not a programmer, just an IT professional with decades of system administration experience, so once again, many thanks for pointing out how to solve my issue, even though I thought it looked similar to this.
Christopher.
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #11 from Julien Isorce julien.isorce@gmail.com --- I found time to go a bit further. Now I understand this radeon_cs_parser_init function a bit more
If I comment the AGP condition here https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon... so that kdata is used then I can verify that the kdata contains the same data as user space.
But when writing to parser->ib.ptr here https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/radeon/radeon... then comparing parser->ib.ptr's data and kdata shows the same difference as pointed in comment #5.
Could it be an issue with pcie (though is works with admgpu, well in fact it uses kdata on amdgpu) ? Is there anyway I can force a commit/flush just after it writes to parser->ib.ptr as a test even if it is slower ? thx!
https://bugs.freedesktop.org/show_bug.cgi?id=107545
--- Comment #12 from Christian König ckoenig.leichtzumerken@gmail.com --- (In reply to Julien Isorce from comment #11)
Could it be an issue with pcie (though is works with admgpu, well in fact it uses kdata on amdgpu) ? Is there anyway I can force a commit/flush just after it writes to parser->ib.ptr as a test even if it is slower ? thx!
Really unlikely, if we would have a hardware problem with PCIe we would see random bit values flip and not a constant pattern like we do.
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Pander pander@users.sourceforge.net changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.freedesktop.or | |g/show_bug.cgi?id=102909
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Pander pander@users.sourceforge.net changed:
What |Removed |Added ---------------------------------------------------------------------------- OS|All |Linux (All)
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Pander pander@users.sourceforge.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Priority|medium |high Hardware|Other |All
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Pander pander@users.sourceforge.net changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.freedesktop.or | |g/show_bug.cgi?id=104307
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Pander pander@users.sourceforge.net changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.freedesktop.or | |g/show_bug.cgi?id=101712
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Pander pander@users.sourceforge.net changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.freedesktop.or | |g/show_bug.cgi?id=105113
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Jan Vesely jan.vesely@rutgers.edu changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also|https://bugs.freedesktop.or | |g/show_bug.cgi?id=105113 |
https://bugs.freedesktop.org/show_bug.cgi?id=107545
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #13 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/856.
dri-devel@lists.freedesktop.org