Hi Mihai,
You have a gen4.5 chipset which is known to be utterly broken for IOMMU+intel gpu. Looks like a few distros started enabling IOMMU by default (fc18 has similar issues) and we've never added the proper quirks. See https://bugzilla.kernel.org/show_bug.cgi?id=51921 for a proposed patch to fix this (i.e. automatically set intel_iommu=igfx_off for affected platfroms). Testing highly welcome.
Cheers, Daniel
On Sat, Jan 19, 2013 at 12:48 AM, Mihai Moldovan ionic@ionic.de wrote:
Hi Daniel, David and everyone else,
I'm experiencing system freezes on a box using the vanilla 3.7.2 (actually down to 3.2 or something) kernel with a custom configuration.
There are two problems:
[*] related to i915 with modeset enabled; upon loading the kernel module with modeset=1, the box will instantly freeze. [*] seemingly unrelated to i915; the box will randomly freeze without any clear indication of why and moreover no apparent trigger.
After months, nay, years of being "locked" into 3.0.2 for the random freezes and i915 problems, I started playing with the kernel again and out of sheer desperation installed the current debian testing kernel, based on 3.2.35.
From what I could see, it worked fine... no more crashes, neither when loading i915, nor randomly after some time (well, at least not for a day.)
This time out of frustration, I ripped the config file used by debian to build the kernel out of its deb package and rebuilt my (almost[1]) vanilla 3.7.2 kernel with this configuration exactly, updated via the oldconfig target and changed to include AHCI, RAID and SCSI drivers statically, so that I wouldn't need some initramfs to boot my system ... and ... with this config, I am not experiencing any i915 problems nor system freezes?!
I then tried to spot any "obvious" differences between the two config files and to "approximate" my config file to the debian config.
Comparing the dmesg output from 3.7.2 built with the slightly modified debian config to my 3.7.2 built with my config, I came across IOMMU entries which differed. My kernel config enables Intel IOMMU by default, while the debian config doesn't.
Looking up IOMMU stuff in Documentation/, I found out that IOMMU *may* have bugs with the internal graphics card and there is an option called intel_iommu=igfx_off to disable IOMMU remapping for the integrated graphics card...
I tried booting "my" kernel with intel_iommu=igfx_off and lo and behold, no more crashes when loading i915 with modeset enabled! Yay... but anyway, that's definitely a kernel bug.
Next, regarding the random freezes... so did the kernel booted with intel_iommu=igfx_off. It seems the random freeze issue is kind of decoupled of the graphics issue.
Testing further, I rebooted using iommu=off and intel_iommu=off. So far, I had no random crashes, but the system uptime of XXXXREPLACEMEXXXX minutes is too small to draw conclusions yet.
Anyway, booting with both options made my USB ports unusable. Also, my PCIe and PCI WiFi cards stopped working. Seems like the kernel can't enumerate those devices due to... guess what, DMA remapping errors!
Note that the debian-config kernel with CONFIG_INTEL_IOMMU=y and CONFIG_INTEL_IOMMU_DEFAULT_ON=n did not produce such errors. Both my USB and WiFi cards have been working.
Any idea why is that?
As I'm not sure who to CC exactly, I'm adding both the i915 and Intel IOMMU maintainers Daniel and David.
I have included several files:
[*] the "debianish" config file [*] my current config file (IOMMU still on by default) [*] dmesg for the kernel built with the "debianish" config file [*] dmesg for the kernel built with "my" config file, no IOMMU options passed [*] dmesg for the kernel built with "my" config file, intel_iommu=igfx_off passed [*] dmesg for the kernel built with "my" config file, iommu=off and intel_iommu=off passed
Hope we can squash those bugs!
Best regards,
Mihai
[1] only one "external" patch applied to ath9k, totally unrelated to the rest of the system, just changing regulatory stuff.
* On 19.01.2013 02:27 PM, Daniel Vetter wrote:
You have a gen4.5 chipset which is known to be utterly broken for IOMMU+intel gpu.
Nice description for what I'm seeing. ;)
After some more hours of uptime I'm inclined to say, that "intel_iommu=off iommu=off" fixes my random freezes as well. Alas, the USB and PCI(e) problems are still around, but I could test recompiling 3.7.2 with Intel IOMMU turned off completely in the kernel config. Interestingly, my 3.0.2 kernel which worked fine for so long doesn't even *have* support for VT-d/Intel IOMMU. This could explain why I wasn't bit by those problems on all previous versions.
[...] and we've never added the proper quirks. See https://bugzilla.kernel.org/show_bug.cgi?id=51921 for a proposed patch to fix this (i.e. automatically set intel_iommu=igfx_off for affected platfroms). Testing highly welcome.
From a quick glance, I don't think this patch will work as-is, my PCI ID 2e12 is missing. I'll add it to the relevant section.
But even if it worked, I'd still have the "box freezes randomly" issue (mostly within 5 to 60 minutes of uptime). :( The only way to get rid of this is disabling Intel IOMMU as a whole via kernel parameters intel_iommu=off iommu=off.
Anyway, I'll give it a try.
Best regards,
Mihai
On Sat, Jan 19, 2013 at 5:13 PM, Mihai Moldovan ionic@ionic.de wrote:
[...] and we've never added the proper quirks. See https://bugzilla.kernel.org/show_bug.cgi?id=51921 for a proposed patch to fix this (i.e. automatically set intel_iommu=igfx_off for affected platfroms). Testing highly welcome.
From a quick glance, I don't think this patch will work as-is, my PCI ID 2e12 is missing. I'll add it to the relevant section.
The quirk matches your pci host bridge, which should have id 2e10, not the gfx, which has id 2e12.
But even if it worked, I'd still have the "box freezes randomly" issue (mostly within 5 to 60 minutes of uptime). :( The only way to get rid of this is disabling Intel IOMMU as a whole via kernel parameters intel_iommu=off iommu=off.
Hm, can you try enabling the related iommu quirk:
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index b9d0911..e834395 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -4251,6 +4251,7 @@ static void quirk_iommu_rwbf(struct pci_dev *dev) }
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2a40, quirk_iommu_rwbf); +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e10, quirk_iommu_rwbf);
#define GGC 0x52 #define GGC_MEMORY_SIZE_MASK (0xf << 8)
Cheers, Daniel
* On 19.01.2013 05:13 PM, Mihai Moldovan wrote:
- On 19.01.2013 02:27 PM, Daniel Vetter wrote:
You have a gen4.5 chipset which is known to be utterly broken for IOMMU+intel gpu.
Nice description for what I'm seeing. ;)
After some more hours of uptime I'm inclined to say, that "intel_iommu=off iommu=off" fixes my random freezes as well. Alas, the USB and PCI(e) problems are still around, but I could test recompiling 3.7.2 with Intel IOMMU turned off completely in the kernel config. Interestingly, my 3.0.2 kernel which worked fine for so long doesn't even *have* support for VT-d/Intel IOMMU. This could explain why I wasn't bit by those problems on all previous versions.
[...] and we've never added the proper quirks. See https://bugzilla.kernel.org/show_bug.cgi?id=51921 for a proposed patch to fix this (i.e. automatically set intel_iommu=igfx_off for affected platfroms). Testing highly welcome.
From a quick glance, I don't think this patch will work as-is, my PCI ID 2e12 is missing. [...]
Which of course will work, as 2e10 is my DRAM controller as reported by lspci, sorry.
But, shouldn't the "DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2eXX, quirk_iommu_rwbf);" calls be rather " DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e00, quirk_iommu_g4x_gfx);" ?
The current patch errors out on my while compiling as quirk_iommu_rwbf is not yet defined at that place.
On Sat, Jan 19, 2013 at 5:26 PM, Mihai Moldovan ionic@ionic.de wrote:
The current patch errors out on my while compiling as quirk_iommu_rwbf is not yet defined at that place.
Oops, attached an old patch, updated one should work better. -Daniel
Hi Daniel,
the patch does work, i.e., it turns off DMAR for the graphics card and alleviates the freezes when loading i915/kms.
However, still seeing random machine freezes with it (being consistent with the behavior I've experienced with intel_iommu=igfx_off).
The patch + forcing RWBF is working, too. Interestingly, this version didn't randomly freeze yet, after more than 5 hours of uptime! I'll leave the box running until tomorrow to make sure I did stick around long enough.
All those tested kernels were able to handle USB and PCI(e) devices.
I still have to test turning off IOMMU in general and Intel IOMMU specifically. Will probably do this tomorrow.
Thank you so far! :)
Mihai
On Sun, Jan 20, 2013 at 10:52 PM, Mihai Moldovan ionic@ionic.de wrote:
the patch does work, i.e., it turns off DMAR for the graphics card and alleviates the freezes when loading i915/kms.
However, still seeing random machine freezes with it (being consistent with the behavior I've experienced with intel_iommu=igfx_off).
Thanks for testing, I've just submitted the patch for review. It should included in a -fixes tree soon and the get backported to stable kernels.
The patch + forcing RWBF is working, too. Interestingly, this version didn't randomly freeze yet, after more than 5 hours of uptime! I'll leave the box running until tomorrow to make sure I did stick around long enough.
All those tested kernels were able to handle USB and PCI(e) devices.
I still have to test turning off IOMMU in general and Intel IOMMU specifically. Will probably do this tomorrow.
Please let me know when this works solidly for you, so that I can put it into a real patch and also submit it for inclusion.
Thanks, Daniel
* On 20.01.2013 11:49 PM, Daniel Vetter wrote:
Thanks for testing, I've just submitted the patch for review. It should included in a -fixes tree soon and the get backported to stable kernels.
Thanks. :)
Please let me know when this works solidly for you, so that I can put it into a real patch and also submit it for inclusion.
No freeze for >24h, I guess we can conclude the quirk does indeed fix the random freeze issue as well. :) I'm all for inclusion.
I'm also currently testing a kernel without the Intel IOMMU feature. This seems to work, too, but also disables Intel TXT and VT-d... At least not seeing USB and PCI(e) issues. I'll leave the box running for some more and will afterwards disable IOMMU as a whole to see if I hit USB and PCI(e) issues again with that combination.
Best regards,
Mihai
[resending to include all previous CC's]
* On 21.01.2013 07:11 PM, Mihai Moldovan wrote:
I'm also currently testing a kernel without the Intel IOMMU feature [CONFIG_INTEL_IOMMU=n, but CONFIG_IOMMU_SUPPORT=y]. [...] At least not seeing USB and PCI(e) issues. I'll leave the box running for some more [time] [...]
No freezes for >22h, seems to be fine.
[...] and will afterwards disable IOMMU as a whole to see if I hit USB and PCI(e) issues again with that combination.
The systems seems to run stable with CONFIG_IOMMU_SUPPORT=n set, too. This is expected. However: unlike during earlier tests when I disabled IOMMU and Intel IOMMU via kernel/boot parameters, I am not seeing any DMA mapping errors.
There seems to be a difference between disabling IOMMU/Intel IOMMU statically in the kernel compared to disabling it via kernel parameter. Is this another bug?
I've attached both kernel ring buffer logs (minus the timings for easier diffing.)
[*] kern-new-iommu_off.log.bz2 disables IOMMU and Intel IOMMU via boot parameter [*] kern-iommu_static_off.log.bz2 has CONFIG_IOMMU_SUPPORT=n set and any IOMMU support statically disabled (also consequently DMAR)
Mihai
On Tue, Jan 22, 2013 at 7:15 PM, Mihai Moldovan ionic@ionic.de wrote:
- On 21.01.2013 07:11 PM, Mihai Moldovan wrote:
I'm also currently testing a kernel without the Intel IOMMU feature [CONFIG_INTEL_IOMMU=n, but CONFIG_IOMMU_SUPPORT=y]. [...] At least not seeing USB and PCI(e) issues. I'll leave the box running for some more [time] [...]
No freezes for >22h, seems to be fine.
[...] and will afterwards disable IOMMU as a whole to see if I hit USB and PCI(e) issues again with that combination.
The systems seems to run stable with CONFIG_IOMMU_SUPPORT=n set, too. This is expected. However: unlike during earlier tests when I disabled IOMMU and Intel IOMMU via kernel/boot parameters, I am not seeing any DMA mapping errors.
There seems to be a difference between disabling IOMMU/Intel IOMMU statically in the kernel compared to disabling it via kernel parameter. Is this another bug?
Behaviour should be the same for the actual dma access at the hw layer, but if you disable things at compile-time at least drm/i915 selects different paths. We've recently killed those though since it's not worth the complexity at all. But dunno why you still get dma errors, that shouldn't be possible. Maybe David has an idea.
I've attached both kernel ring buffer logs (minus the timings for easier diffing.)
[*] kern-new-iommu_off.log.bz2 disables IOMMU and Intel IOMMU via boot parameter [*] kern-iommu_static_off.log.bz2 has CONFIG_IOMMU_SUPPORT=n set and any IOMMU support statically disabled (also consequently DMAR)
In any case I'll ping David about my 2nd quirk patch and whether that's something which makes sense and could be merged. Thanks a lot for all the testing you've done. -Daniel
dri-devel@lists.freedesktop.org