Couple of issues with amdgpu on my WX4100 - dri-devel - freedesktop.org experimental mailing list

2 Jan 2021


      Hi!
I am using this card for about a year and I would like first to say thanks
for open source driver that you made for it, for the big navi
and for the threadripper which brought back fun to the computing.
I bought that card primary to use as a host GPU in VFIO enabled multi-seat
system I am building, and recently I was able (with a minor issue I managed to 
solve, more about it later) to pass that GPU to both linux and windows guest 
mostly flawlessly.
I do have experience in kernel development, and debugging so I am willing
to test patches, etc. Any help is welcome!
So these are the issues:
1.(the biggest issue): The amdgpu driver often crashes when plugging an input.
I tested this now on purpose with 'amdgpu.dc=1' by slowly plugging and unplugging 
an input connector while I wait for the output to stabilize between each cycle, 
and still the issue reproduced after a dozen (or so) tries.
(It only happens when I plug the connector, and never happens when I unplug it)
Then I unloaded the amdgpu driver and loaded it again with dc=0.
This does sort of work but takes a lot of time. The dmesg output is attached 
(amdgpu_dc1_plug_bug.txt)
I did try to increase the number of tries in dm_helpers_read_local_edid, to 
something silly like 1000, but no luck.
I also tried to remove the code below the
'Abort detection for non-DP connectors if we have no EDID'
Also no luck.
This bug pretty much makes it impossible to use the card daily as is
since I do connect/disconnect monitors often, especially due to VFIO usage.
2. I found out that running without the new DC framework (amdgpu.dc=0) solves
issue 1 completely (but costs HDMI sound - HDMI sound only works with amdgpu.dc=1)
I am using this card like that for about at least half an year and haven't had 
a single connector plug/unplug related crash.
Issue 2 however is that in this mode (I haven't tried to reproduce this
with amdgpu.dc=1 yet), sometimes when I unbind the amdgpu driver 
the amdgpu complains about a leaked connector and crashes a bit later on. 
I haven't yet tracked the combination of things needed
to trigger this, but it did happen to me about 3 times already.
I did put a WARN_ON(1) to __drm_connector_put_safe, to see who
is the caller that triggers the delayed work that frees the connector when it is
too late.
I attached a backtrace with the above WARN_ON and the crash (connector_leak_bug.txt)
I also attached the script 'amdgpu_unbind' for the reference that I use to unbind
the amdgpu driver.
3. When doing VFIO passthrough of this card, I found out that it doesn't
suffer that much from the reset bug. As long as I shut down the guest
in clean manner, I can start it again). The vendor_reset module however
makes the reset work even when I shut down the guest right in the middle
of a 3D app running and I tested it many times.
_However_ this only works if I never load the amdgpu linux driver. 
Otherwise a windows guest still boots but all 3D apps in it crash very early.
I tried both the stock drivers that windows auto installs and latest AMD 
workstation drivers from AMD site.
Linux guests do work.
I found out that amdgpu driver resizes the device bars (I have TRX40 platform,
so I don't know if this platform supports the AMD Smart Memory or not,
but according to lspci the device does support resizable BARs).
If I patch the amdgpu's bar resize out, then, the windows guest _does_ work
regardless if I loaded amdgpu prior or not. Linux guests also still work.
I haven't measured the performance impact of this.
For debugging this, I did try to hide the PCI_EXT_CAP_ID_REBAR capability 
from the VM, but it made no difference.
I suspect that once the GPU is resetted, the bars
revert to their original sizes, but VFIO uses the sizes that are cached
by the kernel, so that the guest thinks that the bars are of one size
while they are of an another. I don't have an idea though why this
does work with a Linux guest.
I had attached the pci config with amdgpu running, once with my patch that
stops it from resizing the bars, and once without that patch for reference.
(amdgpu_pciconfig_noresize.txt, amdgpu_pciconfig_resize.txt)
4. I found out that amdgpu runtime PM sometimes breaks the card if last
output is disconnected from it. I didn't debug it much as I just disabled
it with amdgpu.runpm=0) I will do more debug on this later.
Please let me know if you have any questions,
Don't hesitate to ask me for more information.
My setup:
3 outputs, all HDMI, converted with DP->HDMI adapters, of which 2 are 1080P
monitors, and 1 is a 1080P TV. The issues I describe above are reproducible
on all the outputs.
I am running 5.10.0 kernel with few patches and kvm-queue branch 
merged for my day to day work on KVM.
You can find the exact kernel I use and its .config on
https://gitlab.com/maximlevitsky/linux/-/commits/kernel-starship-5.10
Best regards,
    Maxim Levitsky