[Bug 206299] New: [nouveau/xen] RTX 20XX instant reboot

List overview All Threads
Download

newer

older

[radeon-alex:amd-19.50 2027/2713]...

[PATCH v2] drm/hdcp: optimizing...

bugzilla-daemon＠bugzilla.kernel.org

24 Jan 2020 24 Jan '20

11:24 p.m.

https://bugzilla.kernel.org/show_bug.cgi?id=206299

Bug ID: 206299 Summary: [nouveau/xen] RTX 20XX instant reboot Product: Drivers Version: 2.5 Kernel Version: 5.4.X Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: blocking Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: frederic.epitre@orange.fr Regression: No

Created attachment 286963 --> https://bugzilla.kernel.org/attachment.cgi?id=286963&action=edit Kernel log

Hi, On several kernels 4.19.X and 5.3.X or latest one 5.4, I'm having an issue with a NVIDIA RTX 2080TI (also reported by another user with RTX 2070 https://groups.google.com/forum/#!msg/qubes-devel/ozOQrOHsUBQ/XtIQsGm3DgAJ) causing lot of instant reboots of machine. Specifically, the distribution is Qubes OS so Xen is under the hood. On a classical Fedora 31 livecd I don't succeeded to reproduce the crash which is easily reproducible in Qubes (e.g. massive and intensive resize of windows).

Thanks to the help of Marek Marczykowski-Górecki, I obtained the following attached kernel log using netconsole.

Any help would be very appreciated.

Frédéric Pierret

-- You are receiving this mail because: You are watching the assignee of the bug.

Show replies by date

bugzilla-daemon＠bugzilla.kernel.org

25 Jan 25 Jan

12:45 a.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

Ilia Mirkin (imirkin@alum.mit.edu) changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |imirkin@alum.mit.edu

--- Comment #1 from Ilia Mirkin (imirkin@alum.mit.edu) --- Comment on attachment 286963 --> https://bugzilla.kernel.org/attachment.cgi?id=286963 Kernel log

badf5040 = bad mmio read.

Could there be some PCI situation? Can you include a full boot log?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:51 a.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #2 from Frédéric Pierret (frederic.epitre@orange.fr) --- Created attachment 286967 --> https://bugzilla.kernel.org/attachment.cgi?id=286967&action=edit kernel log (dmesg)

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:51 a.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #3 from Frédéric Pierret (frederic.epitre@orange.fr) --- Hi Ilia, Thank you for your answer.

(In reply to Ilia Mirkin from comment #1)

...

Comment on attachment 286963 [details] Kernel log

badf5040 = bad mmio read.

Could there be some PCI situation? Can you include a full boot log?

You'll find dmesg.log attached. By PCI situation you mean hardware issue? If yes, the card is normally functional under Windows. For your information, the GPU remains attached to dom0, not pci-passthroughed on a domU.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

26 Jan 26 Jan

3:02 p.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #4 from Frédéric Pierret (frederic.epitre@orange.fr) --- Hi, While debugging it I found the exception comes from gv100_disp_intr_exc_other in gv100.c because stat = 0x00001800.

I'm trying to figure out what messed up in the 'disp' structure but I'm doing it step by step by first searching for NULL pointers. Any advice for how to proceed?

Thank you.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

3:07 p.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #5 from Ilia Mirkin (imirkin@alum.mit.edu) --- Your kernel log doesn't have anything too weird in it (which is good). However I did see a similar type of error with someone using coreboot (admittedly, with an MCP77 IGP). Are you using a non-original booting mechanism? Given that there's signed firmware situations going on, we can't just re-POST the GPU easily, unlike in the MCP77 case.

The mmio read failures may be a red herring -- basically we try to figure out why the error happened, and get bad mmio reads in the process. Could just be that the error handler hasn't been properly adjusted for Turing, and reads from bad places.

I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have something clever to say.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

3:55 p.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #6 from Frédéric Pierret (frederic.epitre@orange.fr) --- (In reply to Ilia Mirkin from comment #5)

...

Your kernel log doesn't have anything too weird in it (which is good). However I did see a similar type of error with someone using coreboot (admittedly, with an MCP77 IGP). Are you using a non-original booting mechanism? Given that there's signed firmware situations going on, we can't just re-POST the GPU easily, unlike in the MCP77 case.

I'm using standard default bios (legacy mode).

...

The mmio read failures may be a red herring -- basically we try to figure out why the error happened, and get bad mmio reads in the process. Could just be that the error handler hasn't been properly adjusted for Turing, and reads from bad places.

I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have something clever to say.

Hope so and thank you again for your feedback.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

8:20 p.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #7 from Frédéric Pierret (frederic.epitre@orange.fr) --- With Marek, we think to found the problem. In nv50_disp_chan_mthd function, the exact NULL pointer reference is mthd->data[0]->mthd. Precisely, mthd->data is not null but mthd->data[0] seems so.

Trying to access mthd->data[0] we get: BUG: kernel NULL pointer dereference, address: 0000000000000010 while trying to access mthd->data[0]->mthd, we get: BUG: kernel NULL pointer dereference, address: 0000000000000020

So this is exactly the issue. Any idea why mthd->data and not mthd->data[0]?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:45 p.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #8 from Frédéric Pierret (frederic.epitre@orange.fr) --- We found more information!

The previous tests was done with those added lines:

--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c @@ -75,13 +75,25 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug) if (debug > subdev->debug) return;

+ nvkm_warn(subdev, "mthd: %p", mthd); + nvkm_warn(subdev, "mthd->data: %p", mthd->data); + nvkm_warn(subdev, "&mthd->data[0]: %p", &mthd->data[0]); + nvkm_warn(subdev, "mthd->data[0].mthd: %p", mthd->data[0].mthd); for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) {

which gaves as crashlog:

[ 45.513617] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 45.513633] nouveau 0000:26:00.0: disp: mthd: 00000000dfa55708 [ 45.513638] nouveau 0000:26:00.0: disp: mthd->data: 00000000858af80f [ 45.513641] nouveau 0000:26:00.0: disp: &mthd->data[0]: 00000000858af80f

But replacing "%p" by "%lx", it revealed that mthd is NULL:

[ 74.753207] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 74.753223] nouveau 0000:26:00.0: disp: mthd: 0 [ 74.753226] nouveau 0000:26:00.0: disp: mthd->data: 10 [ 74.753231] nouveau 0000:26:00.0: disp: &mthd->data[0]: 10 [ 74.753241] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 74.753244] #PF: supervisor read access in kernel mode

That gives some hints!

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

10:02 p.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #9 from Frédéric Pierret (frederic.epitre@orange.fr) --- A rather simple and temporary fix we found is to add:

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c index bcf32d92ee5a..50e3539f33d2 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c @@ -74,6 +74,8 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug)

if (debug > subdev->debug) return; + if (!mthd) + return;

for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) { u32 base = chan->head * mthd->addr;

With that, it remains stable.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

28 Jan 28 Jan

8:36 a.m.

New subject: [Bug 206299] [nouveau/xen] RTX 20XX instant reboot

https://bugzilla.kernel.org/show_bug.cgi?id=206299

--- Comment #10 from Frédéric Pierret (frederic.epitre@orange.fr) --- Last piece of information, aach time I'm trying to reproduce the freeze and thanks to the fix, I can see a second information in kernel log:

[ 814.207723] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 814.207749] nouveau 0000:26:00.0: bus: MMIO read of 00000000 FAULT at 611390 [ IBUS ]

And it's always repeated as the two lines.

-- You are receiving this mail because: You are watching the assignee of the bug.

1923

Age (days ago)

1927

Last active (days ago)

dri-devel@lists.freedesktop.org

10 comments

1 participants

tags (0)

participants (1)

bugzilla-daemon＠bugzilla.kernel.org