Hi Karol,
On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote:
What exactly is the serious issue? I guess it's that the rescan doesn't detect the GPU, which means it's not responding to config accesses? Is there any timing component here, e.g., maybe we're missing some delay like the ones Mika is adding to the reset paths?
When I was checking up on some of the PCI registers of the bridge controller, the slot detection told me that there is no device recognized anymore. I don't know which register it was anymore, though I guess one could read it up in the SoC spec document by Intel.
My guess is, that the bridge controller fails to detect the GPU being here or actively threw it of the bus or something. But a normal system suspend/resume cycle brings the GPU back online (doing a rescan via sysfs gets the device detected again)
Can you elaborate a bit what kind of scenario the issue happens (e.g steps how it reproduces)? It was not 100% clear from the changelog. Also what the result when the failure happens?
I see there is a script that does something but unfortunately I'm not fluent in Python so can't extract the steps how the issue can be reproduced ;-)
One thing that I'm working on is that Linux PCI subsystem misses certain delays that are needed after D3cold -> D0 transition, otherwise the device and/or link may not be ready before we access it. What you are experiencing sounds similar. I wonder if you could try the following patch and see if it makes any difference?