still happens with your patch applied. The machine simply gets shut down.
dmesg can be found here: https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1...
If there are no other things to try out, I will post the updated patch shortly.
On Mon, Sep 30, 2019 at 11:29 AM Mika Westerberg mika.westerberg@linux.intel.com wrote:
On Mon, Sep 30, 2019 at 11:15:48AM +0200, Karol Herbst wrote:
On Mon, Sep 30, 2019 at 10:05 AM Mika Westerberg mika.westerberg@linux.intel.com wrote:
Hi Karol,
On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote:
What exactly is the serious issue? I guess it's that the rescan doesn't detect the GPU, which means it's not responding to config accesses? Is there any timing component here, e.g., maybe we're missing some delay like the ones Mika is adding to the reset paths?
When I was checking up on some of the PCI registers of the bridge controller, the slot detection told me that there is no device recognized anymore. I don't know which register it was anymore, though I guess one could read it up in the SoC spec document by Intel.
My guess is, that the bridge controller fails to detect the GPU being here or actively threw it of the bus or something. But a normal system suspend/resume cycle brings the GPU back online (doing a rescan via sysfs gets the device detected again)
Can you elaborate a bit what kind of scenario the issue happens (e.g steps how it reproduces)? It was not 100% clear from the changelog. Also what the result when the failure happens?
yeah, I already have an updated patch in the works which also does the rework Bjorn suggested. Had no time yet to test if I didn't mess it up.
I am also thinking of adding a kernel parameter to enable this workaround on demand, but not quite sure on that one yet.
Right, I think it would be good to figure out the root cause before adding any workarounds ;-) It might very well be that we are just missing something the PCIe spec requires but not implemented in Linux.
I see there is a script that does something but unfortunately I'm not fluent in Python so can't extract the steps how the issue can be reproduced ;-)
One thing that I'm working on is that Linux PCI subsystem misses certain delays that are needed after D3cold -> D0 transition, otherwise the device and/or link may not be ready before we access it. What you are experiencing sounds similar. I wonder if you could try the following patch and see if it makes any difference?
I think I already tried this path. The problem isn't that the device isn't accessible too late, but that it seems that the device completely falls off the bus. But I can retest again just to be sure.
Yes, please try it and share full dmesg if/when the failure still happens.