On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
[+cc linux-acpi, linux-kernel, dri-devel]
Hi Roland,
I have no idea how to debug this problem. Are you seeing something that suggests it may be a PCI problem?
Yes I suspect there is an ACPI and/ or PCI problem, possibly device-specific. Steps to reproduce on the affected machines:
1. Load nouveau. 2. Wait for it to runtime suspend. 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau. 3. lspci never returns, few moments later an AML_INFINITE_LOOP is reported.
If you use the external bbswitch module, the effect is the same. I have been trying to debug this for some time on nouveau with no luck. The PCI/PM D3cold patches from Mika makes no difference.
Runtime resume via nouveau triggers some ACPI methods (I'll assume the Windows 8-style PR method and take the Clevo P651 as example):
_SB.PCI0.PEG0.PG00._ON () -> _SB.PCI0.PGON (0)
Then:
Method (PGON, 1, Serialized) { PION = Arg0 // note: 0 for PG00 // ... If ((OSYS != 0x07DF)) { /* Not Windows 2015 (Windows 10), see below */ } Else { LKEN (PION) } // this is the infinite loop: it tries to bring the PCIe link to // full speed, but fails to do so. While ((_SB.PCI0.PEG0.LNKS < 0x07)) { Local0 = 0x20 While (Local0) { If ((_SB.PCI0.PEG0.LNKS < 0x07)) { Stall (0x64) Local0-- } Else { Break } } If ((Local0 == Zero)) { _SB.PCI0.PEG0.RTLK = One Stall (0x64) } } // ... }
Without any workaround, this piece of code is invoked:
Method (LKEN, 1, NotSerialized) { Local3 = (CPEX & 0x0F) // CPEX at 0x5ff9be7f and has value 000506e3 If ((Local3 == Zero)) { /* Similar to below, but with Q0L0 -> P0L0 (register 0xBC bit 6) */ } ElseIf ((Local3 != Zero)) { If ((Arg0 == Zero)) { /* Enter L0 Activate state. * (LKDS tries to enter L2, deep-energy-saving state.) */ Q0L0 = One // register 0x249 bit 0; _SB.PCI0.OPG0.Q0L0 00:01.0 Sleep (0x10) Local0 = Zero While (Q0L0) { If ((Local0 > 0x04)) { Break } Sleep (0x10) Local0++ } } else { /* other cases, but we are only interested in PGON(0) */ } } }
The acpi_osi="!Windows 2015" workaround will invoke this instead:
If ((OSYS != 0x07DF)) { If ((PION == Zero)) { P0AP = Zero /* PGOF writes 3 */ P0RM = Zero /* PGOF writes 1 */ } If ((PBGE != Zero)) { /* Observed to be false (PBGE == 0) */ If (SBDL (PION)) { PUAB (PION) CBDL = GUBC (PION) MBDL = GMXB (PION) If ((CBDL > MBDL)) { CBDL = MBDL /* _SB_.PCI0.MBDL */ } PDUB (PION, CBDL) } } If ((PION == Zero)) { P0LD = Zero /* Link Disable = 0, PGOF sets 1 instead. */ P0TR = One /* Train? (PGOF does not set this). */ TCNT = Zero While ((TCNT < LDLY)) { /* LDLY = 300 */ If ((P0VC == Zero)) { /* VC Negotiation Pending 0 means VC negotation is complete. */ Break } Sleep (0x10) TCNT += 0x10 /* At most 19 iterations, sleeping for 304ms. */ } } }
The comments above are my own interpretation based on the acpidumps I extracted from the machine. These notes and ACPI tables can be found at https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt https://github.com/Lekensteyn/acpi-stuff/tree/master/dsl/Clevo_P651RA
Other affected devices have similar code, differences are small: - No check for LNKS (avoids the infinite loop, but device is still off) - Instead of a check for != "Windows 2015", they check for == "Windows 2009" or even for == "Windows 2009" || "Windows 2013" (Dell Inspiron 7559).
The tested kernels (with bbswitch or nouveau) were Linux 4.4.0, 4.6, 4.7 (nouveau + PCI/PM + nouveau PR patches). The PCIe device is something from the GTX 9xxM family in all cases.
I have a bunch of PCI config dumps from Windows and Linux, but there is nothing extraordinary. Also did an ACPI trace via a Checked/Debug build of Windows, but it just confirms that the ACPI method we use for the Nvidia device is the correct one.
Let me know if you need more information, I would be glad to provide.
Kind regards, Peter
On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
Hi,
hope somebody can help me fix this kernel problem which affects the following machines:
- Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
- MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
- Gigabyte P35V5 (i7-6700HQ/GTX 970M)
- Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
The kernel freezes if the graphical user session (Xorg & Wayland) is started with a switched off discrete GPU card (NVIDIA). If the discrete GPU is switched off after the graphical session start, then everything works as expected, until the graphical session is restarted.
This problem seams to be linked to specific BIOS settings. If the computer is started with the following command line:
acpi_osi=! acpi_osi="Windows 2009"
then the kernel freeze does not occur anymore. However this required a special ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
I strongly recommend to fix this in the kernel and I am ready to help and solve this problem with some help.
Here is a link to the GitHub issue with further information:
https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-24121...
Here are some more detailed information:
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
Hope somebody can help.