(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface).
On Sun, 1 Aug 2010 08:55:49 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
Innocuous-looking one-liner is said to have made Milan's X server even worse than normal.
Summary: [i915] Framebuffer ID error after suspend/hibernate leading to X crash Product: Drivers Version: 2.5 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: Video(DRI - Intel) AssignedTo: drivers_video-dri-intel@kernel-bugs.osdl.org ReportedBy: nalimilan@club.fr CC: chris@chris-wilson.co.uk Regression: Yes
I've been experiencing X freezes and crashes for more than a year, and with every kernel version the cause of the bug changes. After Linus pushed 985b823b919273fe1327d56d2196b4f92e5d0fae to 2.6.35rc6 (see below [2]), I'm now getting an "invalid framebuffer id" error that kills my X server. Before that commit, I was getting an oops, which was reported in bugs.fd.o as [1].
/var/log/kern.log: [ 1467.408347] PM: Finishing wakeup. [ 1467.408350] Restarting tasks ... done. [ 1467.434616] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id [ 1467.747233] sky2 0000:02:00.0: eth0: enabling interface [...] [ 1512.204160] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 1512.205452] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 11072 at 11071)
At this point, the X server is killed, and won't restart: Fatal server error: Failed to submit batchbuffer: Input/output error
Excerpt from lspci -vnn: 00:02.1 Display controller [0380]: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller [8086:2792] (rev 03) Subsystem: Toshiba America Info Systems Device [1179:ff00] Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Region 0: Memory at 64000000 (32-bit, non-prefetchable) [disabled] [size=512K] Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME-
1: https://bugs.freedesktop.org/show_bug.cgi?id=26974 2: commit 985b823b919273fe1327d56d2196b4f92e5d0fae Author: Linus Torvalds torvalds@linux-foundation.org Date: Fri Jul 2 10:04:42 2010 +1000
drm/i915: fix hibernation since i915 self-reclaim fixes Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915: Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the i915 page allocator where we weren't before due to some over-eager removal of the page mapping gfp_flags games the code used to play. This caused hibernate on Intel hardware to result in a lot of memory corruptions on resume. See for example http://bugzilla.kernel.org/show_bug.cgi?id=13811
-- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
On Mon, 2 Aug 2010 16:55:03 -0700, Andrew Morton akpm@linux-foundation.org wrote:
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface).
On Sun, 1 Aug 2010 08:55:49 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
Innocuous-looking one-liner is said to have made Milan's X server even worse than normal.
We go from a random OOPS to a consistent error (and a failing userspace). It sounds more likely that we have uncovered a real bug, probably in the ddx.
On Tue, Aug 3, 2010 at 12:25 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
On Mon, 2 Aug 2010 16:55:03 -0700, Andrew Morton akpm@linux-foundation.org wrote:
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface).
On Sun, 1 Aug 2010 08:55:49 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
Innocuous-looking one-liner is said to have made Milan's X server even worse than normal.
We go from a random OOPS to a consistent error (and a failing userspace). It sounds more likely that we have uncovered a real bug, probably in the ddx.
I can't really imagine that that one-liner made the difference. Not under any normal load. I suspect it just changes some allocation pattern very subtly, and then the memory scribble (or whatever) that really causes the bug perhaps changes.
The original oops reported in launchpad was
BUG: unable to handle kernel NULL pointer dereference at 00000108 IP: [<f8578b97>] intel_release_load_detect_pipe+0x27/0xb0 [i915]
and as far as I can tell, that's due to a load off a NULL crtc, here:
struct drm_crtc_helper_funcs *crtc_funcs = crtc->helper_private;
the disassembly is
0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 83 ec 14 sub $0x14,%esp 6: 89 5d f4 mov %ebx,-0xc(%ebp) 9: 89 75 f8 mov %esi,-0x8(%ebp) c: 89 7d fc mov %edi,-0x4(%ebp) f: 0f 1f 44 00 00 nopl 0x0(%eax,%eax,1) 14: 8b b0 ec 02 00 00 mov 0x2ec(%eax),%esi # crtc = encoder->crtc 1a: 89 c3 mov %eax,%ebx 1c: 8b 80 f4 02 00 00 mov 0x2f4(%eax),%eax # dev = encoder->dev 22: 89 d7 mov %edx,%edi 24: 89 45 f0 mov %eax,-0x10(%ebp) 27:* 8b 8e 08 01 00 00 mov 0x108(%esi),%ecx <-- trapping instruction (crtc_funcs = crtc->helper_private) 2d: 80 bb 04 03 00 00 00 cmpb $0x0,0x304(%ebx) # intel_encoder->load_detect_temp 34: 75 2a jne 0x60 36: 0f b6 46 18 movzbl 0x18(%esi),%eax # crtc->enabled 3a: 84 c0 test %al,%al
in case anybody cares. However, I have no idea how ctrc would be NULL in the first place there, it comes from
struct drm_encoder *encoder = &intel_encoder->enc; ... struct drm_crtc *crtc = encoder->crtc;
and I don't know the setup code. It _does_ strike me that the C code does:
... struct drm_crtc *crtc = encoder->crtc; struct drm_encoder_helper_funcs *encoder_funcs = encoder->helper_private; struct drm_crtc_helper_funcs *crtc_funcs = crtc->helper_private;
if (intel_encoder->load_detect_temp) { encoder->crtc = NULL; connector->encoder = NULL; ....
where I react to the fact that first we load "crtc = encoder->crtc" and dereference that pointer (crtc->helper_private) without checking whether it might be NULL, and then in some case we clear that field (encoder->crtc = NULL), so clearly the whole "encoder->crtc" field _can_ be NULL.
However, I don't see why it should only show up for some people...
Linus
dri-devel@lists.freedesktop.org