Here is Peter Wu's reply, which was not send to the mailing list, because I had to resend my e-mail to him due to a failure...
-------- Forwarded Message -------- Subject: Re: Fwd: Re: Kernel Freeze with American Megatrends BIOS Date: Wed, 31 Aug 2016 18:08:53 +0200 From: Peter Wu peter@lekensteyn.nl To: Roland Singer roland.singer@desertbit.com
On Wed, Aug 31, 2016 at 05:56:18PM +0200, Roland Singer wrote:
If you look at my notes.txt, you will see that _OFF always executes the same code. PGON differs. When the problem occurs, "Q0L0" somehow always reads back as non-zero and LNKS < 7.
Oh you're Lekensteyn ^^
Yes, that's me :) I wrote bbswitch, did the Optimus and PR3 ACPI support in nouveau so I am fairly certain what happens behind the scenes.
I don't have LNKS and no while loop after calling LKEN ?!
Yes that is what I said in https://www.spinics.net/lists/linux-pci/msg53694.html:
"Other affected devices have similar code, differences are small: No check for LNKS (avoids the infinite loop, but device is still off)"
I noticed following:
- Blacklist nouveau
- Boot to GDM login manager (Wayland)
- Switch to TTY with CTRL+ALT+FN2
- Load bbswitch
- Switch off GPU
- run lspci -> no freeze
- Switch to GDM
- Login to a Wayland session (X11 won't work)
- run lspci in a GUI terminal -> system freezes
Is nouveau somehow loaded anyway? All those extra components (X11, Wayland, etc.) are unnecessary to reproduce the core problem. It occurs whenever the device is being resumed (either via DSM/_PS0 or via power resource PG00._ON).
Sorry that was nonsense. The steps to reproduce the problem are still valid. I didn't wait enough to power it down...
But whats interesting:
- Blacklist nouveau
- Load bbswitch
- Power off GPU with bbswitch
- Power on GPU with bbswitch
- Run lspci
- Power off GPU with bbswitch
- Run lspci -> freeze
So setting the GPU power state with bbswitch works as expected. Powering it on is also fine. I did this a couple of times. But powering it off and letting lspci powering it on, ends in a race.
In some cases I also found that it does always happen at the first try, but with nouveau it always seem to happen.
It might be, that lspci does not only power the GPU on, but triggers another pci action which causes the race condition. Does this have something to do with your quote about the retrain bit?
That is an interesting hypothesis. Even if you invoke `lspci -s01:00.0` for example, it will always probe for all devices. So maybe interaction with its parent device (PCI root port 00:02.0) causes issues.
However I also tested without lspci before, and the problem still exists. You can trigger runtime resume via (as root):
echo > /sys/bus/pci/0000:01:00.0/power/control on
Set it to "auto" to make it sleep again.