https://bugzilla.kernel.org/show_bug.cgi?id=188271
Bug ID: 188271 Summary: IOMMU DMAR fault with NVIDIA CUDA peer to peer Product: Drivers Version: 2.5 Kernel Version: 4.8.6 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: vadim@sourced.tech Regression: No
My motherboard is Supermicro X10DRG-Q (details in attached output of dmidecode). It has 2 Xeon E5-2620 v4 (details in attached lscpu output). Two Titan X 2016 GPUs are inserted into PCIe slots (see nvidia-smi output). After enabling of the peer to peer access between those two cards, execution of cudaMemcpyPeer() hangs and dmesg shows:
[16193.612535] DMAR: DRHD: handling fault status reg 602 [16193.617662] DMAR: [DMA Write] Request device [82:00.0] fault addr 387fc000c000 [fault reason 05] PTE Write access is not set [16193.661857] DMAR: DRHD: handling fault status reg 702 [16193.666976] DMAR: [DMA Write] Request device [82:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set (edited)
I am using CoreOS, and the whole stuff happens inside a docker container running with -device /dev/nvidiactl --device /dev/nvidia0 --device /dev/nvidia1 --device /dev/nvidia-uvm --privileged --security-opt seccomp=unconfined
The addition of intel_iommu=igfx_off to kernel command line cures the problem and peer to peer works perfectly.
https://bugzilla.kernel.org/show_bug.cgi?id=188271
--- Comment #1 from Vadim Markovtsev vadim@sourced.tech --- Created attachment 245361 --> https://bugzilla.kernel.org/attachment.cgi?id=245361&action=edit dmidecode -t 2
https://bugzilla.kernel.org/show_bug.cgi?id=188271
--- Comment #2 from Vadim Markovtsev vadim@sourced.tech --- Created attachment 245371 --> https://bugzilla.kernel.org/attachment.cgi?id=245371&action=edit lscpu
https://bugzilla.kernel.org/show_bug.cgi?id=188271
--- Comment #3 from Vadim Markovtsev vadim@sourced.tech --- Created attachment 245381 --> https://bugzilla.kernel.org/attachment.cgi?id=245381&action=edit lspci -knnv
https://bugzilla.kernel.org/show_bug.cgi?id=188271
--- Comment #4 from Vadim Markovtsev vadim@sourced.tech --- Created attachment 245391 --> https://bugzilla.kernel.org/attachment.cgi?id=245391&action=edit nvidia-smi proto -m
https://bugzilla.kernel.org/show_bug.cgi?id=188271
--- Comment #5 from Vadim Markovtsev vadim@sourced.tech --- Created attachment 245401 --> https://bugzilla.kernel.org/attachment.cgi?id=245401&action=edit cat /proc/cmdline
Added intel_iommu=off
https://bugzilla.kernel.org/show_bug.cgi?id=188271
--- Comment #6 from Vadim Markovtsev vadim@sourced.tech --- Created attachment 245411 --> https://bugzilla.kernel.org/attachment.cgi?id=245411&action=edit uname -a
dri-devel@lists.freedesktop.org