https://bugzilla.kernel.org/show_bug.cgi?id=206299
Bug ID: 206299 Summary: [nouveau/xen] RTX 20XX instant reboot Product: Drivers Version: 2.5 Kernel Version: 5.4.X Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: blocking Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: frederic.epitre@orange.fr Regression: No
Created attachment 286963 --> https://bugzilla.kernel.org/attachment.cgi?id=286963&action=edit Kernel log
Hi, On several kernels 4.19.X and 5.3.X or latest one 5.4, I'm having an issue with a NVIDIA RTX 2080TI (also reported by another user with RTX 2070 https://groups.google.com/forum/#!msg/qubes-devel/ozOQrOHsUBQ/XtIQsGm3DgAJ) causing lot of instant reboots of machine. Specifically, the distribution is Qubes OS so Xen is under the hood. On a classical Fedora 31 livecd I don't succeeded to reproduce the crash which is easily reproducible in Qubes (e.g. massive and intensive resize of windows).
Thanks to the help of Marek Marczykowski-Górecki, I obtained the following attached kernel log using netconsole.
Any help would be very appreciated.
Frédéric Pierret
https://bugzilla.kernel.org/show_bug.cgi?id=206299
Ilia Mirkin (imirkin@alum.mit.edu) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |imirkin@alum.mit.edu
--- Comment #1 from Ilia Mirkin (imirkin@alum.mit.edu) --- Comment on attachment 286963 --> https://bugzilla.kernel.org/attachment.cgi?id=286963 Kernel log
badf5040 = bad mmio read.
Could there be some PCI situation? Can you include a full boot log?
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #2 from Frédéric Pierret (frederic.epitre@orange.fr) --- Created attachment 286967 --> https://bugzilla.kernel.org/attachment.cgi?id=286967&action=edit kernel log (dmesg)
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #3 from Frédéric Pierret (frederic.epitre@orange.fr) --- Hi Ilia, Thank you for your answer.
(In reply to Ilia Mirkin from comment #1)
Comment on attachment 286963 [details] Kernel log
badf5040 = bad mmio read.
Could there be some PCI situation? Can you include a full boot log?
You'll find dmesg.log attached. By PCI situation you mean hardware issue? If yes, the card is normally functional under Windows. For your information, the GPU remains attached to dom0, not pci-passthroughed on a domU.
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #4 from Frédéric Pierret (frederic.epitre@orange.fr) --- Hi, While debugging it I found the exception comes from gv100_disp_intr_exc_other in gv100.c because stat = 0x00001800.
I'm trying to figure out what messed up in the 'disp' structure but I'm doing it step by step by first searching for NULL pointers. Any advice for how to proceed?
Thank you.
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #5 from Ilia Mirkin (imirkin@alum.mit.edu) --- Your kernel log doesn't have anything too weird in it (which is good). However I did see a similar type of error with someone using coreboot (admittedly, with an MCP77 IGP). Are you using a non-original booting mechanism? Given that there's signed firmware situations going on, we can't just re-POST the GPU easily, unlike in the MCP77 case.
The mmio read failures may be a red herring -- basically we try to figure out why the error happened, and get bad mmio reads in the process. Could just be that the error handler hasn't been properly adjusted for Turing, and reads from bad places.
I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have something clever to say.
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #6 from Frédéric Pierret (frederic.epitre@orange.fr) --- (In reply to Ilia Mirkin from comment #5)
Your kernel log doesn't have anything too weird in it (which is good). However I did see a similar type of error with someone using coreboot (admittedly, with an MCP77 IGP). Are you using a non-original booting mechanism? Given that there's signed firmware situations going on, we can't just re-POST the GPU easily, unlike in the MCP77 case.
I'm using standard default bios (legacy mode).
The mmio read failures may be a red herring -- basically we try to figure out why the error happened, and get bad mmio reads in the process. Could just be that the error handler hasn't been properly adjusted for Turing, and reads from bad places.
I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have something clever to say.
Hope so and thank you again for your feedback.
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #7 from Frédéric Pierret (frederic.epitre@orange.fr) --- With Marek, we think to found the problem. In nv50_disp_chan_mthd function, the exact NULL pointer reference is mthd->data[0]->mthd. Precisely, mthd->data is not null but mthd->data[0] seems so.
Trying to access mthd->data[0] we get: BUG: kernel NULL pointer dereference, address: 0000000000000010 while trying to access mthd->data[0]->mthd, we get: BUG: kernel NULL pointer dereference, address: 0000000000000020
So this is exactly the issue. Any idea why mthd->data and not mthd->data[0]?
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #8 from Frédéric Pierret (frederic.epitre@orange.fr) --- We found more information!
The previous tests was done with those added lines:
--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c @@ -75,13 +75,25 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug) if (debug > subdev->debug) return;
+ nvkm_warn(subdev, "mthd: %p", mthd); + nvkm_warn(subdev, "mthd->data: %p", mthd->data); + nvkm_warn(subdev, "&mthd->data[0]: %p", &mthd->data[0]); + nvkm_warn(subdev, "mthd->data[0].mthd: %p", mthd->data[0].mthd); for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) {
which gaves as crashlog:
[ 45.513617] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 45.513633] nouveau 0000:26:00.0: disp: mthd: 00000000dfa55708 [ 45.513638] nouveau 0000:26:00.0: disp: mthd->data: 00000000858af80f [ 45.513641] nouveau 0000:26:00.0: disp: &mthd->data[0]: 00000000858af80f
But replacing "%p" by "%lx", it revealed that mthd is NULL:
[ 74.753207] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 74.753223] nouveau 0000:26:00.0: disp: mthd: 0 [ 74.753226] nouveau 0000:26:00.0: disp: mthd->data: 10 [ 74.753231] nouveau 0000:26:00.0: disp: &mthd->data[0]: 10 [ 74.753241] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 74.753244] #PF: supervisor read access in kernel mode
That gives some hints!
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #9 from Frédéric Pierret (frederic.epitre@orange.fr) --- A rather simple and temporary fix we found is to add:
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c index bcf32d92ee5a..50e3539f33d2 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c @@ -74,6 +74,8 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug)
if (debug > subdev->debug) return; + if (!mthd) + return;
for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) { u32 base = chan->head * mthd->addr;
With that, it remains stable.
https://bugzilla.kernel.org/show_bug.cgi?id=206299
--- Comment #10 from Frédéric Pierret (frederic.epitre@orange.fr) --- Last piece of information, aach time I'm trying to reproduce the freeze and thanks to the fix, I can see a second information in kernel log:
[ 814.207723] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 814.207749] nouveau 0000:26:00.0: bus: MMIO read of 00000000 FAULT at 611390 [ IBUS ]
And it's always repeated as the two lines.
dri-devel@lists.freedesktop.org