https://bugzilla.kernel.org/show_bug.cgi?id=150731
Bug ID: 150731 Summary: amdgpu: segfault on unbind in sysfs; card becomes nonresponsive Product: Drivers Version: 2.5 Kernel Version: 4.6.4 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: JimiJames.Bove@gmail.com Regression: No
Full details here: https://www.reddit.com/r/linux_gaming/comments/4udupx/nvidiaamd_support_ques...
Summary: I'm using an R9 380. Others confirmed having this issue on the R9 285 and RX 480 (so, Tonga & Polaris 10 at least).
I can bind my video card to amdgpu, and that works. It crashes X, but when I log back in, it's properly connected and everything.
However, if I try to unbind it, after waiting for a few seconds, I get a segfault. Any subsequent attempts to do anything with that card in sysfs--trying to unbind again, trying to bind to something else, etc.--will get stuck forever, never segfaulting, because the card is not responding.
Removing the card (echo 1 > /sys/bus/pci/devices/0000:0X:00.0/remove) works, but after a rescan (echo 1 > /sys/bus/pci/rescan), the card is no longer in sysfs at all, as if it's been powered down. It can't be accessed by the system in any way after that, until the computer reboots.
It may or may not be related to the "reset issues" bug: http://vfio.blogspot.de/2015/04/progress-on-amd-front.html https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg03128.html That bug officially only affects Hawaii and Bonaire, but Tonga cards (380, 285) exhibit the same behavior even if it may not be for the same reason. Whether it affects Polaris 10 (RX 480) is unknown. The RX 480 tester is currently finding that out.
I also had this issue on 4.6.1, so it probably at least affects 4.6 in general. Maybe all kernel versions that have amdgpu?
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #1 from Jimi JimiJames.Bove@gmail.com --- Clarification: If I bind the card to amdgpu, X doesn't crash when I actually bind it to amdgpu. I never actually get to bind it to amdgpu. X crashes when I rescan PCI devices (after unbinding the card from whatever it was originally bound to, which in my case is always vfio-pci because I pass it into a QEMU/KVM virtual machine). When I log back in, the card has been automatically bound to amdgpu successfully.
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #2 from Jimi JimiJames.Bove@gmail.com --- Another clarification: the behavior is the same if I don't bind the card to amdgpu myself and let it be bound to amdgpu on boot, automatically, which is how I usually test it.
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #3 from Jimi JimiJames.Bove@gmail.com --- I've now confirmed this issue on Fiji (R9 Fury) as well.
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #4 from Jimi JimiJames.Bove@gmail.com --- Created attachment 228411 --> https://bugzilla.kernel.org/attachment.cgi?id=228411&action=edit X crash log
Here are my Xorg logs for when I unbind from vfio-pci, remove, and rescan, and X crashes and comes back with the card bound. The post-crash log is the Xorg.0.log file, which just shows X loading a desktop that uses both cards (although the AMD card has "Ignore" set to "true" since it's just meant for running games with the DRI_PRIME variable), and the crash log is the Xorg.0.log.old file, which captures the moment of the crash starting at time [456.336].
You can see there aren't any errors in there. It seems to be just reconfiguring the graphics because it noticed a new available card, and somehow that resulted in me being booted back to the login screen. And according to all the tutorials I've read on switching a card between vfio-pci and X, it shouldn't even be doing this on its own. It should be waiting for me to bind the card to amdgpu myself. Why is it doing it automatically and booting me out?
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #5 from Jimi JimiJames.Bove@gmail.com --- Created attachment 228421 --> https://bugzilla.kernel.org/attachment.cgi?id=228421&action=edit X post-crash log
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #6 from Jimi JimiJames.Bove@gmail.com --- Created attachment 228431 --> https://bugzilla.kernel.org/attachment.cgi?id=228431&action=edit dmesg log
And now I've tried unbinding it from amdgpu without X running at all, and it of course didn't work, confirming its kernel bug status. I've captured the dmesg log, and as far as I can tell, the part of the log that pertains to amdgpu is the stack trace starting at [1131.985756].
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #7 from Jimi JimiJames.Bove@gmail.com --- I should mention at this point, I think there are 2 different bugs going on. One bug is making it impossible to unbind any cards from the driver, and another bug is making X immediately bind itself to an amdgpu card the instant it becomes available and crash. The former is definitely something wrong with amdgpu in the kernel, but the latter could be X's fault--I don't know. Just in case, I've filed a report for X, too: https://bugs.freedesktop.org/show_bug.cgi?id=97313
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #8 from Jimi JimiJames.Bove@gmail.com --- Created attachment 228441 --> https://bugzilla.kernel.org/attachment.cgi?id=228441&action=edit dmesg log (amdgpu-pro)
And here's the dmesg log from testing this with amdgpu-pro (without X running), with the crash starting at [137.003975].
amdgpu-pro exhibited almost exactly the same behavior. The only difference was instead of getting a segfault after a few seconds, the terminal session that unbound the card was immediately spammed with the dmesg stack trace in this attached file.
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #9 from Jimi JimiJames.Bove@gmail.com --- Over at the X bug report, I've figured out that when X has AutoAddGPU turned off, it doesnt crash (meaning that bug is not related to this bug), however, the card is still automatically bound to amdgpu before I can bind it, which means 2 things: 1. That's going to be a problem when I'm able to successfully unbinding my card from amdgpu and will need to be able to respond PCI devices without it auto-binding to amdgpu, because I'll want to bind it to vfio-pci. 2. That part of the X bug is actually its own bug, which makes sense, because the card would immediately auto-bind to amdgpu if I tested things without X running.
So, as far as this bug report is concerned, we actually have 2 bugs going on: the card can't be unbound, and the card is automatically bound on a rescan, stopping the user from having a choice in which driver it gets bound to. I think these 2 issues just may be related. Maybe they even have the same cause?
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #10 from Jimi JimiJames.Bove@gmail.com --- Sorry, autocorrect typos. Thing #1 is supposed to say that it's going to be a problem when I'm able to successfully unbind my card from amdgpu and will need to be able to rescan PCI devices without it auto-binding to amdgpu, because I'll want to bind it to vfio-pci.
https://bugzilla.kernel.org/show_bug.cgi?id=150731
Luke A. Guest (laguest@archeia.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |laguest@archeia.com
--- Comment #11 from Luke A. Guest (laguest@archeia.com) --- I can confirm that the OS completely hangs when unbinding R9 380 (Tonga Pro) with X running. Works fine with X off.
I have amdgpu and vfio-pci both in kernel, used the following to unbind it.
#!/bin/bash for dev in "$@"; do vendor=$(cat /sys/bus/pci/devices/$dev/vendor) device=$(cat /sys/bus/pci/devices/$dev/device) if [ -e /sys/bus/pci/devices/$dev/driver ]; then echo $dev > /sys/bus/pci/devices/$dev/driver/unbind fi echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id done
lspci -nnk shows:
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] [1002:6939] (rev f1) Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 380 Nitro 4G D5 [174b:e308] Kernel driver in use: vfio-pci 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380] [1002:aad8] Subsystem: PC Partner Limited / Sapphire Technology Radeon R9 285/380 HDMI Audio [174b:aad8] Kernel driver in use: vfio-pci Kernel modules: snd_hda_intel
https://bugzilla.kernel.org/show_bug.cgi?id=150731
--- Comment #12 from Jimi (JimiJames.Bove@gmail.com) --- Turns out this bug has been getting ignored because of an extremely obscure fact about terrible bug report website organization, that I'm sure has screwed many other people in the past: https://bugzilla.kernel.org/show_bug.cgi?id=195321#c5
Thankfully, someone else posted it in the right place recently: https://bugs.freedesktop.org/show_bug.cgi?id=100399
Let's add our voices to that.
dri-devel@lists.freedesktop.org