https://bugs.freedesktop.org/show_bug.cgi?id=93649
Bug ID: 93649 Summary: [radeonsi] Graphics lockup while playing tf2 Product: Mesa Version: 11.0 Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: matthew@mjdsystems.ca QA Contact: dri-devel@lists.freedesktop.org
Created attachment 120925 --> https://bugs.freedesktop.org/attachment.cgi?id=120925&action=edit Kernel dmesg around the time of the lockup.
After a period of time playing the latest version of TF2, my GPU locks up. After the kernel tries to reset, the X becomes stuck and won't work. The rest of the system is fine however. Sometimes, the GPU will reset successfully and continue working, only to lockup later, eventually freezing X.
Hardware: GPU: Gigabyte Radeon HD 7970 Ghz edition OC CPU: AMD Phenom ii X6 1100T MB: Asus Crosshair IV Formula
Software: Mesa: 11.1.0 DRM: 2.4.65 LLVM: 3.7.0 X: 1.17.4 DDX: 7.6.1 Kernel: 4.3.3
I have a dmesg with debug turned on and a strace of X from around the time it crashes (attached). I reduced the log file to the relevant bits, as they are quite large. I'll retry with latest git, see if it helps anywhere.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #1 from Matthew Dawson matthew@mjdsystems.ca --- Created attachment 120926 --> https://bugs.freedesktop.org/attachment.cgi?id=120926&action=edit Strace of Xorg up to X freezing
FD 20 is the drm device node, and it freezes on ioctl 0xc020645d.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #2 from Matthew Dawson matthew@mjdsystems.ca --- Created attachment 120927 --> https://bugs.freedesktop.org/attachment.cgi?id=120927&action=edit Radeon blocked locks
Since X seemed blocked on an ioctl, I managed to get a list of all the blocked locks, and found most of my taken locks were from GUI related programs who would be doing GL things, and they are all blocked on a lock, including one that is currently trying to reset my GPU.
I'm guessing there is a lock that is being grabbed twice, once when userspace makes an ioctl, and again during the reset. I'll keep digging.
Also, I think this may be a duplicate of #90217, as both involve source games. I'll leave this open for now, in case tf2 has a different trigger.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #3 from russianneuromancer@ya.ru --- There is other logs: https://github.com/ValveSoftware/Source-1-Games/issues/1943
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #4 from Matthew Dawson matthew@mjdsystems.ca --- Created attachment 121242 --> https://bugs.freedesktop.org/attachment.cgi?id=121242&action=edit This helps avoid a complete crash when a lockup occurs.
Note this doesn't solve this bug, it just helps manage it.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #5 from pc.jago1337@gmail.com --- Can confirm, I have either the same or a similar problem on my R9 390 (using radeon, with DPM disabled). It doesn't just crash X though, it completely locks up and I have to reboot to even use TTY. Happens after 10-20 mins of TF2.
Running Arch Linux with everything up to date but no AUR packages, will post specifics later.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #6 from Matthew Dawson matthew@mjdsystems.ca --- Created attachment 121293 --> https://bugs.freedesktop.org/attachment.cgi?id=121293&action=edit Second patch to fix system lockup after gpu reset
This is already taken accepted from the mailing list, including here for completeness.
If anyone is experiencing this issue, can you please try with all of these patches applied? For now, X should die and restart without acceleration, but getting a dmesg out or restarting should be fine.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #7 from pc.jago1337@gmail.com --- CPU: FX 8350 GPU: R9 390 MB: Asrock 970 Extreme4
Software:
Kernel: 4.3.3-3-ARCH x86_64 Mesa: 11.1.1 DRM: 2.43.0 LLVM: 3.7.0 X: 1.18.0
As mentioned above, I get the crash with TF2, but *NOT* CS:GO.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #8 from pc.jago1337@gmail.com --- Also, this could be a duplicate of bug #92912 - random lockups in TF2, all with radeon.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #9 from Matthew Dawson matthew@mjdsystems.ca --- (In reply to pc.jago1337 from comment #8)
Also, this could be a duplicate of bug #92912 - random lockups in TF2, all with radeon.
I was asked to file this bug separately. Also that covers R600, a different GPU the GCN.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #10 from Rosco P. Coltrane roscofdporg@manashort.com --- Same problem here on a fedora 23
GPU: HD 7970 CPU: Intel Core i7 950
Mesa 11.1.0 DRM 2.43.0 LLVM 3.7.0 kernel: 4.3.4
The logs are filed with "ring stalled" and GPU lock messages. I can send more logs if needed.
radeon 0000:02:00.0: ring 3 stalled for more than 10249msec radeon 0000:02:00.0: GPU lockup (current fence id 0x000000000001e5f1 last fence id 0x000000000001e5f2 on ring 3)
I've tried a different firmware (http://people.freedesktop.org/~agd5f/radeon_ucode/k/) which seemed to have helped other people with their own problem but it didn't help in my case.
Does it makes sense to try to rollback to an older kernel?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
Matthew Dawson matthew@mjdsystems.ca changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #121242|0 |1 is obsolete| |
--- Comment #11 from Matthew Dawson matthew@mjdsystems.ca --- Created attachment 121578 --> https://bugs.freedesktop.org/attachment.cgi?id=121578&action=edit New avoid lockup patch
Latest version as posted to dri-devel. With these two patches, your system should no longer lockup forever. It will freeze the game for a moment, and X may die for other reasons.
Now the underlying tf2 issue needs investigation.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #12 from Luca Osvaldo lukycrociato@gmail.com --- I can say that it also affects me, I'm using the AMDGPU drivers with powerplay enabled, using a custom linux4.5 kernel. AMD r9 380 video card.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
Matthew Dawson matthew@mjdsystems.ca changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |matias.locatti@gmail.com
--- Comment #13 from Matthew Dawson matthew@mjdsystems.ca --- *** Bug 95308 has been marked as a duplicate of this bug. ***
https://bugs.freedesktop.org/show_bug.cgi?id=93649
AmarildoJr amarildosjr@riseup.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO
--- Comment #14 from AmarildoJr amarildosjr@riseup.net --- Any chance VALVe introduced this? They won't admit it. https://github.com/ValveSoftware/steam-for-linux/issues/4409
The patches attatched here are present in Linux 4.6. I tested linux-git-4.7-rc7 with mesa-git-12.1 compiled against llvm-snv-3.9, and TF2 still crashes.
Setting every graphical option to Low doesn't help.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #15 from Nicolai Hähnle nhaehnle@gmail.com --- This is certainly a bug in our driver (unlike what was written on the Github tracker, a game *can* cause a hang e.g. by writing an infinite loop in a shader, but that seems exceedingly unlikely in the case of TF2). The problem with this particular bug is that it seems non-deterministic (i.e. not reliably reproducible), and that makes it hard to debug.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #16 from AmarildoJr amarildosjr@riseup.net --- So there's a chance it won't be fixed at all?
I was thinking about bisecting from version 3.16 (where I know it worked for me, on Debian Jessie) until ~4.1, but I don't have that kind of time right now.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #17 from Nicolai Hähnle nhaehnle@gmail.com --- Actually, if you could find a clear bisection result, that would be tremendously helpful and would probably lead to a fix.
However, with this kind of bug you need to be extremely sure about what you're doing when bisecting. For example, if you know that the hang typically occurs after 10 minutes, then you should play for at least one hour (perhaps even longer) with each kernel. Otherwise, you might have just gotten lucky, and the bisect result would be worse than useless.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #18 from AmarildoJr amarildosjr@riseup.net --- Yes, I would definitely test it for a long period, something like 16 hours hehehe.
However, I can't do any besecting right now, I'm tremendously busy at the moment. Too bad there's not many Linux players with this problem, otherwise someone would have figured this out already.
Cheers.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #19 from pandiculationfinch@gmail.com --- happens with stellaris as well.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #20 from Marek Olšák maraeo@gmail.com --- Does this fix it?
https://cgit.freedesktop.org/mesa/mesa/commit/?id=947e0614d091c260651e4f3d62...
In other words, does mesa/master work?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
Matthew Dawson matthew@mjdsystems.ca changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |REOPENED
--- Comment #21 from Matthew Dawson matthew@mjdsystems.ca --- I can confirm lastest git head (50b49d242d702e4728329cc59f87d929963e7c53) still causes lockups, though they seem to come much faster.
Also seems to have a regression regarding lighting, I'll see about bisecting that in a separate report.
LLVM: 3.8.0 DRM: 2.43.0 Linux: 4.6.3-gentoo
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #22 from pandiculationfinch@gmail.com --- I'll test this weekend with stellaris and let you know.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #23 from pandiculationfinch@gmail.com --- sad to say it did not fix the issue for me. it ran longer than usual though prior to the crash. I suspect you nixed one issue but multiple are going on.
I'm happy to run any debugging/patches you wish to try.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #24 from AmarildoJr amarildosjr@riseup.net --- Didn't fix for me either, on Arch Linux.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
AmarildoJr amarildosjr@riseup.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |amarildosjr@riseup.net
--- Comment #25 from AmarildoJr amarildosjr@riseup.net --- Marek, since you work for AMD, I wonder if you could get a few hints for the fix on Catalyst's sources?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #26 from Marek Olšák maraeo@gmail.com --- (In reply to AmarildoJr from comment #25)
Marek, since you work for AMD, I wonder if you could get a few hints for the fix on Catalyst's sources?
It's not so simple. This is a bug somewhere in the Mesa driver such that looking at other drivers won't likely help.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #27 from AmarildoJr amarildosjr@riseup.net --- (In reply to Marek Olšák from comment #26)
(In reply to AmarildoJr from comment #25)
Marek, since you work for AMD, I wonder if you could get a few hints for the fix on Catalyst's sources?
It's not so simple. This is a bug somewhere in the Mesa driver such that looking at other drivers won't likely help.
This is a very weird issue. I think it may not be in Mesa, and here's why:
* On Debian Jessie with kernel 3.16 and Mesa 10.3, the problem doesn't happen; * On the same Debian, but with mesa backported, the problem also doesn't happen; * On the same Debian with Mesa backported and the Kernel backported, the problem still doesn't happen; * On Arch Linux with Mesa downgraded to 10.3, the problem happens; * On the same Arch Linux with Mesa and Kernel downgraded (Kernel to version 3.16 and even 3.10), the problem still happens; * I'm not 100% sure I downgraded the Firmware on Arch, but I'll try today since I'm testing a few drivers in Linux; * On vanilla Arch with Catalyst/FGLRX, the problem doesn't happen;
So I do think this issue is much bigger than everybody thinks and only happens with a certain combination of Mesa, Kernel, Firmware, and possibly libdrm, llvm, and other pieces of software as well.
What I really think is that VALVe should investigate this since this problem started happening after they introduced mandatory Texture Streaming.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #28 from Vedran Miletić vedran@miletic.net --- (In reply to AmarildoJr from comment #27)
(In reply to Marek Olšák from comment #26)
(In reply to AmarildoJr from comment #25)
Marek, since you work for AMD, I wonder if you could get a few hints for the fix on Catalyst's sources?
It's not so simple. This is a bug somewhere in the Mesa driver such that looking at other drivers won't likely help.
This is a very weird issue. I think it may not be in Mesa, and here's why:
- On Debian Jessie with kernel 3.16 and Mesa 10.3, the problem doesn't
happen;
- On the same Debian, but with mesa backported, the problem also doesn't
happen;
- On the same Debian with Mesa backported and the Kernel backported, the
problem still doesn't happen;
- On Arch Linux with Mesa downgraded to 10.3, the problem happens;
- On the same Arch Linux with Mesa and Kernel downgraded (Kernel to version
3.16 and even 3.10), the problem still happens;
- I'm not 100% sure I downgraded the Firmware on Arch, but I'll try today
since I'm testing a few drivers in Linux;
- On vanilla Arch with Catalyst/FGLRX, the problem doesn't happen;
So I do think this issue is much bigger than everybody thinks and only happens with a certain combination of Mesa, Kernel, Firmware, and possibly libdrm, llvm, and other pieces of software as well.
What I really think is that VALVe should investigate this since this problem started happening after they introduced mandatory Texture Streaming.
Is the elephant in the room in this case the LLVM version difference between the two setups?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #29 from AmarildoJr amarildosjr@riseup.net --- I just tested the oldest firmware available in the Arch Linux Archive, namely linux-firmware 20130725-1, and the crashes don't happen. This is with current Arch, not a single package is old and all packages are up-to-date according to the repos.
I'm hitting 10 to 30 FPS in-game, but at least the crashes don't happen which IMO is a very good sign of where the problem might be.
I'll report the firmware problem to AMD.
In the mean time, does anyone know how I can try running the firmware from Catalyst?
@Marek, where is the best place to report this?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #30 from AmarildoJr amarildosjr@riseup.net --- (In reply to Vedran Miletić from comment #28)
Is the elephant in the room in this case the LLVM version difference between the two setups?
According to a Gentoo user who compiled llvm 3.5 and and older version of mesa against it, the problem still occurs.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #31 from Marek Olšák maraeo@gmail.com --- (In reply to AmarildoJr from comment #29)
I just tested the oldest firmware available in the Arch Linux Archive, namely linux-firmware 20130725-1, and the crashes don't happen. This is with current Arch, not a single package is old and all packages are up-to-date according to the repos.
I'm hitting 10 to 30 FPS in-game, but at least the crashes don't happen which IMO is a very good sign of where the problem might be.
I'll report the firmware problem to AMD.
In the mean time, does anyone know how I can try running the firmware from Catalyst?
@Marek, where is the best place to report this?
So are we certain the hangs are caused by firmware? Bisecting the firmware would help a lot.
What's your GPU?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #32 from Rosco P. Coltrane roscofdporg@manashort.com --- I tested today 3 different firmwares on manjaro (HD7970)
linux-firmware-20150527.3161bfa-1-any.pkg.tar.xz (chosen because it was a bit before the first bugs were reported with TF2)
This allowed me to play TF2 without bugs for ~30 min. Then I had the bug (screen freeze, sound loop) but the system recovered fine after 20 sec with no loss of performance. I still had a problem before and after the bug with the mouse pointer which wasn't visible at all time.
linux-firmware-20131013.7d0c7a8-1-any.pkg.tar.xz
This allowed me to play for a good hour, then: bug + recovery after 20 sec. At the fifth bug the screen simply hanged, TF2 and steam crashed. (had to ctrl+alt+f2). This one didn't have the mouse bug. This is the most stable TF2 experience I can get.
linux-firmware-20130725-1-any.pkg.tar.xz (earlier firmware available in the repo)
This one crashed after 2 seconds loading the first map.
The first two firmwares also seem to have fixed the same bug which was present in "Victor Vran" (same symptoms, screen freeze + sound loop).
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #33 from pandiculationfinch@gmail.com --- not certain but assuming I ran the test correctly, I experienced a crash using the oldest linux firmware I had linux-firmware-20140828. that leaves 13 months of time to bisect if linux-firmware 20130725-1 does indeed work. I'll see about trying installing the 20130725 version later have other stuff I need to do.
commands run to downgrade to linux-firmware-20140828: sudo pacman -U /var/cache/pacman/pkg/linux-firmware-20140828.13eb208-1-any.pkg.tar.xz sudo pacman -S linux
after downgrade I had the following error on boot, so I'm assuming it worked: Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/TAHITI_vce.bin failed with error -2 Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: radeon_vce: Can't load firmware "radeon/TAHITI_vce.bin" Sep 04 09:53:14 jambli kernel: radeon 0000:01:00.0: failed VCE (-2) init.
other info: Name : llvm-libs Version : 3.8.1-1 Name : linux Version : 4.7.2-1 Name : mesa-git Version : 84594.98f734e-1
Extended renderer info (GLX_MESA_query_renderer): Vendor: X.Org (0x1002) Device: AMD OLAND (DRM 2.45.0 / 4.7.2-1-ARCH, LLVM 4.0.0) (0x6610) Version: 12.1.0 Accelerated: yes Video memory: 2048MB Unified memory: no Preferred profile: core (0x1) Max core profile version: 4.3 Max compat profile version: 3.0 Max GLES1 profile version: 1.1 Max GLES[23] profile version: 3.1
I forget the exact card off the top of my head but here is the output of lspci, if you need more precise card information let me know how to get it from the cli =): 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350] 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #34 from pandiculationfinch@gmail.com --- I should note I was testing against stellaris.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #35 from pandiculationfinch@gmail.com --- game froze again after ~20minutes. using the 20130725 version firmware. so if downgrading to 20130725 fixes TF2 it likely isn't the same issue as TF2. game: stellaris
commands run to downgrade to linux-firmware-20130725: sudo pacman -U /var/cache/pacman/pkg/linux-firmware-20130725-1-any.pkg.tar.xz sudo pacman -S linux
other info: Name : llvm-libs Version : 3.8.1-1 Name : linux Version : 4.7.2-1 Name : mesa-git Version : 84594.98f734e-1 Name : linux-firmware Version : 20130725-1
lspci: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350] 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
boot logs: Sep 04 11:12:28 jambli kernel: [drm] initializing kernel modesetting (OLAND 0x1002:0x6610 0x174B:0xE269 0x00). Sep 04 11:12:28 jambli kernel: [drm] register mmio base: 0xFDD80000 Sep 04 11:12:28 jambli kernel: [drm] register mmio size: 262144 Sep 04 11:12:28 jambli kernel: ATOM BIOS: C66201 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: VRAM: 2048M 0x0000000000000000 - 0x000000007FFFFFFF (2048M used) Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: GTT: 2048M 0x0000000080000000 - 0x00000000FFFFFFFF Sep 04 11:12:28 jambli kernel: [drm] Detected VRAM RAM=2048M, BAR=256M Sep 04 11:12:28 jambli kernel: [drm] RAM width 128bits DDR Sep 04 11:12:28 jambli kernel: [TTM] Zone kernel: Available graphics memory: 8209378 kiB Sep 04 11:12:28 jambli kernel: [TTM] Zone dma32: Available graphics memory: 2097152 kiB Sep 04 11:12:28 jambli kernel: [TTM] Initializing pool allocator Sep 04 11:12:28 jambli kernel: [TTM] Initializing DMA pool allocator Sep 04 11:12:28 jambli kernel: [drm] radeon: 2048M of VRAM memory ready Sep 04 11:12:28 jambli kernel: [drm] radeon: 2048M of GTT memory ready. Sep 04 11:12:28 jambli kernel: [drm] Loading oland Microcode Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_pfp.bin failed with error -2 Sep 04 11:12:28 jambli systemd[1]: Created slice system-lvm2\x2dpvscan.slice. Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_me.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_ce.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_rlc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_mc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/OLAND_mc2.bin failed with error -2 Sep 04 11:12:28 jambli kernel: [drm] radeon/OLAND_mc.bin: 31452 bytes Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/oland_smc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/OLAND_smc.bin failed with error -2 Sep 04 11:12:28 jambli kernel: smc: error loading firmware "radeon/OLAND_smc.bin" Sep 04 11:12:28 jambli kernel: [drm] Internal thermal controller with fan control Sep 04 11:12:28 jambli kernel: [drm] radeon: power management initialized Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: Direct firmware load for radeon/TAHITI_vce.bin failed with error -2 Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: radeon_vce: Can't load firmware "radeon/TAHITI_vce.bin" Sep 04 11:12:28 jambli kernel: radeon 0000:01:00.0: failed VCE (-2) init.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #36 from Marek Olšák maraeo@gmail.com --- If you're testing Mesa git, would you please set GALLIUM_DDEBUG="pipelined 2000" and run TF2, wait until the GPU hangs and repeat. After it happens for the 3rd time, please zip and attach the contents of ~/ddebug_dumps/*. There should be 3 files.
Though I've got a hunch that we're just running around in circles.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #37 from pandiculationfinch@gmail.com --- Created attachment 126454 --> https://bugs.freedesktop.org/attachment.cgi?id=126454&action=edit stellaris run via steam: GALLIUM_DDEBUG="pipelined 2000" %command%
here are the dumps generated.
it seems like a hit or miss if anything was actually written into the files. the computer completely locks up when it encounter the freeze in stellaris.
stellaris was even more unstable with the GALLIUM_DDEBUG, often failing to even start up.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #38 from AmarildoJr amarildosjr@riseup.net --- Does anyone have a little bit of free time to extract the files from "lib32-catalyst-libgl" into a system running "lib32-mesa-libgl" and see if that helps?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #39 from hofmann.zachary@gmail.com --- I'm also having this problem with Radeon R7 250 (radeonsi), Mesa 12.0.2, LLVM 3.8.1 and kernel version 4.6.0.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #40 from AmarildoJr amarildosjr@riseup.net --- If disabling DPM fixed the issue, shouldn't developers study it's code a little bit? I'm 99.99% positive the issue is in there somewhere, even for AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code).
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #41 from hofmann.zachary@gmail.com --- (In reply to Amarildo from comment #40)
If disabling DPM fixed the issue, shouldn't developers study it's code a little bit? I'm 99.99% positive the issue is in there somewhere, even for AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code).
Another user previously stated in the thread that they were experiencing the issues and had DPM disabled.
@Marek Olšák Please let me know if there's anything I can do to help hunt this bug down.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #42 from Amarildo amarildo-geral@autistici.org --- (In reply to hofmann.zachary from comment #41)
(In reply to Amarildo from comment #40)
If disabling DPM fixed the issue, shouldn't developers study it's code a little bit? I'm 99.99% positive the issue is in there somewhere, even for AMDGPU (since RadeonSI and AMDGPU drivers share a lot of code).
Another user previously stated in the thread that they were experiencing the issues and had DPM disabled.
@Marek Olšák Please let me know if there's anything I can do to help hunt this bug down.
But that's one user's word against at least 5. Do we even know if the user actually disabled DPM or has the capacity to do so? Because I'm sure me and others (like Gentoo users) did in fact disable DPM and the hang didn't happen. So I don't think our word is less valid just because *one* user claimed he/she disabled DPM and the hang still happened.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #43 from Amarildo amarildo-geral@autistici.org --- Just tried Mesa-Git (13.1) with the AMDGPU driver on R9 270X. The crash happens here as well.
However, looking at journalctl I can see new errors from the AMDGPU driver, and a brief research tells me it could be some TF2 texturing problem.
The error: GPU fault detected: 147 0x000ac802
Similar bugs have been resolved already:
https://bugs.freedesktop.org/show_bug.cgi?id=87278 https://bugs.freedesktop.org/show_bug.cgi?id=84614
LLVM seems to be related too.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #44 from Rosco P. Coltrane roscofdporg@manashort.com --- I don't know if it can be of any help, but I've been playing "7 days to die" during the last weeks, regularly for the last days, and I didn't encounter any kind of bug.
Until yesterday evening where at my great surprise I had the same bug (freeze, sound loop) which totally crashed my machine once and only froze it (with a recovery after a few seconds) twice.
I checked that no update occurred on the game files, on the steam runtime and on my OS between the days when it worked flawlessly and yesterday when it crashed 3 time in 15 minutes.
So if it's not only related to files, could it be related to the hardware? Could it be a faulty card (HD7970), or maybe a mix between a faulty hardware and some software instruction?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #45 from Amarildo amarildo-geral@autistici.org --- Faulty hardware doesn't make any sense, because:
- It only happens on Linux; - It only happens with specific combinations of Mesa/LLVM/Kernel/Firmware/etc - It doesn't happen with the proprietary drivers
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #46 from hofmann.zachary@gmail.com --- (In reply to Amarildo from comment #45)
Faulty hardware doesn't make any sense, because:
- It only happens on Linux;
- It only happens with specific combinations of Mesa/LLVM/Kernel/Firmware/etc
- It doesn't happen with the proprietary drivers
It's probably not the exact same crash, but FWIW I also get crashes with the proprietary driver and TF2 when I tested it last. I just don't want people to get their hopes up only to have them let down.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #47 from Amarildo amarildo-geral@autistici.org --- In all honesty, this is one of the most interesting bugs I know. Within all the people that have it, there are variations to which causes it in the first place.
What works for me (Debian Jessie with Mesa/libc6 from Backports, for example) might still cause the crash for some people.
What I do know is that it's not caused by faulty hardware. It could be for some, but seriously doubt it it's the cause for 99.99% of people experiencing the issue.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #48 from Marek Olšák maraeo@gmail.com --- Does this fix the hangs? https://cgit.freedesktop.org/mesa/mesa/commit/?id=d4d9ec55c589156df4edc227a8...
It changes the HTILE (HyperZ) allocation function to r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang happens when TTM decides to move HTILE to a different location with an unaligned physical address (which is pretty random). The hardware tries to access the unaligned address and boom.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #49 from Marek Olšák maraeo@gmail.com --- (In reply to Marek Olšák from comment #48)
Does this fix the hangs? https://cgit.freedesktop.org/mesa/mesa/commit/ ?id=d4d9ec55c589156df4edc227a86b4a8c41048d58
It changes the HTILE (HyperZ) allocation function to r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang happens when TTM decides to move HTILE to a different location with an unaligned physical address (which is pretty random). The hardware tries to access the unaligned address and boom.
Actually, I think that commit only affects Hawaii and Fiji. Other GPUs might be unaffected, which means the Tahiti hangs are due to a different bug.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #50 from Matthew Dawson matthew@mjdsystems.ca --- (In reply to Marek Olšák from comment #49)
(In reply to Marek Olšák from comment #48)
Does this fix the hangs? https://cgit.freedesktop.org/mesa/mesa/commit/ ?id=d4d9ec55c589156df4edc227a86b4a8c41048d58
It changes the HTILE (HyperZ) allocation function to r600_aligned_buffer_create. Without that, the hardware can hang on big GPUs (Tahiti/Pitcairn/Hawaii/Tonga/etc), but not APUs or small GPUs. The hang happens when TTM decides to move HTILE to a different location with an unaligned physical address (which is pretty random). The hardware tries to access the unaligned address and boom.
Actually, I think that commit only affects Hawaii and Fiji. Other GPUs might be unaffected, which means the Tahiti hangs are due to a different bug.
I've previously tried disabling hyperz on Tahiti with no luck in side stepping this bug, so I don't think this is the issue.
Could there be other buffers that need similar treatment that are being ignored? Is there an easy way to test this locally?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #51 from Marek Olšák maraeo@gmail.com --- You can try this:
diff --git a/src/gallium/winsys/radeon/drm/radeon_drm_bo.c b/src/gallium/winsys/radeon/drm/radeon_drm_bo.c index a15d559..ab95bae 100644 --- a/src/gallium/winsys/radeon/drm/radeon_drm_bo.c +++ b/src/gallium/winsys/radeon/drm/radeon_drm_bo.c @@ -939,7 +939,7 @@ radeon_winsys_bo_create(struct radeon_winsys *rws, struct radeon_drm_winsys *ws = radeon_drm_winsys(rws); struct radeon_bo *bo; unsigned usage = 0, pb_cache_bucket; - +alignment *= 2; /* Only 32-bit sizes are supported. */ if (size > UINT_MAX) return NULL;
It will only affect radeon, not amdgpu.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #52 from hofmann.zachary@gmail.com --- Unless the changed code works independently of the nohyperz option I don't think it will help, since disabling hyperz on verde doesn't help either.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #53 from smoki smoki00790@gmail.com ---
It might be possible that game fixes something, as i see there was game update 3 days ago with the following mentioned in changelog:
"Improved several aspects of texture handling for OS X and Linux clients
This should reduce the rate of "Out of memory" errors for players on high texture settings, especially on level change Players still encountering this error can reduce texture quality to medium or lower to greatly improve stability pending further improvements"
http://store.steampowered.com/news/25022/
Just wild guessing that this might change something, since game started to be unstable on radeonsi when streaming textures and reduction of mem was introduced last year.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #54 from Amarildo amarildo-geral@autistici.org --- I remember disabling stream textures and still having the issue, as well as setting all graphic settings to minimal.
Can anyone confirm the status of this bug on Pitcairn + Mesa-git + amdgpu kernel driver?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #55 from Amarildo amarildo-geral@autistici.org --- Seems that hang handling wasn't implemented at all for some GPU's: https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd?h=amd-s...
I haven't yet tried playing TF2 with amd-staging-4.7 (though I have been using it for a few days). I'll try it this morning.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #56 from Amarildo amarildo-geral@autistici.org --- Didn't work, hang is still there. I couldn't even go to tty2 this time.
amd-staging-4.7 compiled this morning mesa-git llvm-git
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #57 from hofmann.zachary@gmail.com --- As smoki mentioned, many of the troubles started after Valve's texture streaming changes to TF2. They'd certainly know what changed in their code, but for someone like me they're impossible to get a hold of.
http://www.teamfortress.com/post.php?id=19733
https://bugs.freedesktop.org/show_bug.cgi?id=93649
pandiculationfinch@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |pandiculationfinch@gmail.co | |m
--- Comment #58 from pandiculationfinch@gmail.com --- Created attachment 127704 --> https://bugs.freedesktop.org/attachment.cgi?id=127704&action=edit package update history that lead to a change in behaviour
Last night the freezes I've been having changed their behaviour. They use to just cause the system to completely freeze up. Now my system does a immediate shutdown.
this is interesting because I had just updated linux and mesa-git so I potentially have a commit range in mesa/llvm which has code related to the problem. I'm going to rollback my kernel/headers tonight and reboot to rule that out. And if that doesn't cause the hang to re-appear I'll roll back mesa tomorrow. and then I'll rollback llvm.
In the meantime I've attached the package update history for the last few days in case that helps any of the developers.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #59 from pandiculationfinch@gmail.com --- sigh turns something else must have caused the shutdowns, the game is back to just freezing the system today. =/
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #60 from Rosco P. Coltrane roscofdporg@manashort.com --- Some people are reporting that they can reproduce the bug on windows 7.
https://github.com/ValveSoftware/Source-1-Games/issues/1943#issuecomment-260...
Are we absolutely sure that it is not a hardware problem?
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #61 from hofmann.zachary@gmail.com --- I haven't seen anything to rule out it being a hardware problem, but Valve's overwhelming silence on the matter isn't exactly helpful.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #62 from pandiculationfinch@gmail.com --- I finally found the root cause for my problems.
Turns out my CPU was overheating. But I only stressed it enough when playing games and nothing showed up in the logs about a shutdown due to heat. Once i resolved the overheating all my games ran smoothly with no crashes. apologies for the noise.
Wish I had found it sooner.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #63 from Cooper Blake the_analogkid@yahoo.com --- I am also see my system completely crash after running Team Fortress 2 for typically 5-20 minutes. In the last three occurrences, I've seen the following:
1. Freeze and system reboot within 10 seconds. I did not see anything in the logs. 2. Successful playing for ~30 minutes without issue. 3. Freeze and sound loop. The screen resets and sound loop changes every 10-20 seconds, which I believe is when the system is trying to reset the GPU. However, it never succeeds, and the system becomes completely non-responsive. The keyboard does not seem to accept input (num lock is frozen, can't switch to console). The only thing I can do is a hard restart. This scenario happens almost every time.
Output from journalctl looks like this: Nov 24 21:26:42 fedora kernel: radeon 0000:01:00.0: ring 3 stalled for more than 10181msec Nov 24 21:26:42 fedora kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000075bec last fence id 0x0000000000075bf7 on ring 3)
Backtrace starts like this: Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) Backtrace: Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 0: /usr/libexec/Xorg (OsLookupColor+0x139) [0x59f679] Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 1: /lib64/libc.so.6 (__restore_rt+0x0) [0x7f4ec08bf7df] Nov 24 21:26:42 fedora /usr/libexec/gdm-x-session[2242]: (EE) 2: /lib64/libc.so.6 (__memcpy_sse2_unaligned+0x29) [0x7f 4ec0927739] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 3: /usr/lib64/dri/radeonsi_dri.so (__driDriverGetExtensi ons_virtio_gpu+0x37401a) [0x7f4eb9d88e7a] ... Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 15: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0xa16e) [0x7f4ebafcfd3e] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 16: /usr/libexec/Xorg (DamageRegionAppend+0x618) [0x520ea8] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 17: /usr/lib64/xorg/modules/libglamoregl.so (glamor_create_gc+0x11427) [0x7f4ebafde9e7] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 18: /usr/libexec/Xorg (AddTraps+0x56b1) [0x51c1d1] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 19: /usr/libexec/Xorg (SendErrorToClient+0x2df) [0x436e2f] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 20: /usr/libexec/Xorg (remove_fs_handlers+0x463) [0x43ae63] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 21: /lib64/libc.so.6 (__libc_start_main+0xf1) [0x7f4ec08ab731] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 22: /usr/libexec/Xorg (_start+0x29) [0x424d59] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) 23: ? (?+0x29) [0x29] Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) Nov 24 21:26:43 fedora /usr/libexec/gdm-x-session[2242]: (EE) Bus error at address 0x7f4eb5af5008
I am running Fedora 24 with the latest updates: Hardware: CPU: AMD Athlon II x3 450 GPU: Sapphire / AMD Radeon R7 350 w/ 2GB GDDR5 GPU chipset: Cape Verde
Kernel: 4.8.7-200.fc24.x86_64 Mesa: 12.0.3 LLVM: 3.8.0 DRM: 2.46.0 Driver: radeonsi
I have played a couple other Valve games for several hours with no problems: Portal, Portal 2, and Dota 2.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #64 from Amarildo amarildo-geral@autistici.org --- Have any of you tried this? https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da47...
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #65 from Marek Olšák maraeo@gmail.com --- (In reply to Amarildo from comment #27)
What I really think is that VALVe should investigate this since this problem started happening after they introduced mandatory Texture Streaming.
If you are right about texture streaming, the cso commit might fix it.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #66 from Amarildo amarildo-geral@autistici.org --- OH MY LORD
Been playing for 25 minutes so far, no hangs at all.
I'll test more!
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #67 from Amarildo amarildo-geral@autistici.org --- 45 minutes, not a single crash. I believe it's fixed.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #68 from Amarildo amarildo-geral@autistici.org --- Played 2 sessions of 1 hour each, no hangs at all. To me, this is fixed.
"Thanks", I guess? 1 years is still better than nothing, AMD :P
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #69 from Michel Dänzer michel@daenzer.net --- FWIW, the fundamental problem caught by Marek (good catch!) was there for almost 9 years. It just might not have had quite as severe consequences with other drivers.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #70 from hofmann.zachary@gmail.com --- Well of course it needs more testing to be sure, but I'll probably be doing this soon.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #71 from Amarildo amarildo-geral@autistici.org --- It would be really unfortunate if this didn't fix the issue for everybody.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #72 from null32@airmail.cc --- RX470 here, I've been playing for more than 1 hour and no crash so far. Thank you!
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #73 from hofmann.zachary@gmail.com --- One hour is not enough testing. I applied this patch to mesa 13.0.2 and the game still locks up.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #74 from Amarildo amarildo-geral@autistici.org --- (In reply to hofmann.zachary from comment #73)
One hour is not enough testing. I applied this patch to mesa 13.0.2 and the game still locks up.
I believe you need mesa-git and llvm-svn for it to work.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #75 from null32@airmail.cc --- (In reply to hofmann.zachary from comment #73)
One hour is not enough testing. I applied this patch to mesa 13.0.2 and the game still locks up.
Make sure you're using a patched version of the 32 bit libraries too. I managed to play almost 3 hours in a row in a full server and in different maps without issues at all.
These are the packages that I'm using:
* linux 4.8.12-2 * linux-firmware 20161005.9c71af9-1
* mesa-git 13.1.0_devel.87233.bd56de8-1 * lib32-mesa-git 13.1.0_devel.87233.bd56de8-1
* llvm-svn 4.0.0svn_r289147-1 * lib32-llvm-svn 4.0.0svn_r289117-1
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #76 from Amarildo amarildo-geral@autistici.org --- (In reply to null32 from comment #75)
(In reply to hofmann.zachary from comment #73)
One hour is not enough testing. I applied this patch to mesa 13.0.2 and the game still locks up.
Make sure you're using a patched version of the 32 bit libraries too. I managed to play almost 3 hours in a row in a full server and in different maps without issues at all.
These are the packages that I'm using:
linux 4.8.12-2
linux-firmware 20161005.9c71af9-1
mesa-git 13.1.0_devel.87233.bd56de8-1
lib32-mesa-git 13.1.0_devel.87233.bd56de8-1
llvm-svn 4.0.0svn_r289147-1
lib32-llvm-svn 4.0.0svn_r289117-1
He confirmed it working :D
https://github.com/ValveSoftware/Source-1-Games/issues/1943#issuecomment-266...
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #77 from hofmann.zachary@gmail.com --- Oops, forgot to confirm the patch working here too. Yes, the game works without crashing now.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
Marek Olšák maraeo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution|--- |FIXED
--- Comment #78 from Marek Olšák maraeo@gmail.com --- Fixed by: https://cgit.freedesktop.org/mesa/mesa/commit/?id=6dc96de303290e8d1fc294da47...
Closing.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #79 from Timothy Arceri t_arceri@yahoo.com.au --- *** Bug 95308 has been marked as a duplicate of this bug. ***
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #80 from Amarildo amarildo-geral@autistici.org --- Uh oh. This bug may be back.
I'm back on Linux. First time playing for more than 30 mins (my little sister was playing) PC hangs.
Will test it to see whether it's this hellish bug or not.
https://bugs.freedesktop.org/show_bug.cgi?id=93649
--- Comment #81 from Alex Deucher alexdeucher@gmail.com --- (In reply to Amarildo from comment #80)
Uh oh. This bug may be back.
I'm back on Linux. First time playing for more than 30 mins (my little sister was playing) PC hangs.
Will test it to see whether it's this hellish bug or not.
Not likely to be the same issue if there is a hang. Please file a new bug report.
dri-devel@lists.freedesktop.org