My system is a Fedora 16 x86_64 running self-compiled vanilla kernel (.config attached for -rc5). I am getting an apparent memory corruption that starts since linux 3.2-rc5. No such corruption was noticed in 3.2-rc4. On the first instance, I eventually got a NULL pointer dereference and the screen went black, the keyboard was unresponsive, and I had to hard-reboot the machine. On the second instance, I managed to reboot the machine normally. The full message log is attached. An extract of the first WARNING:
Dec 11 00:38:47 karlalex kernel: [ 2016.175191] ------------[ cut here ]------------ Dec 11 00:38:47 karlalex kernel: [ 2016.175200] WARNING: at lib/list_debug.c:30 __list_add+0x66/0x7f() Dec 11 00:38:47 karlalex kernel: [ 2016.175202] Hardware name: OEM Dec 11 00:38:47 karlalex kernel: [ 2016.175204] list_add corruption. prev->next should be next (ffff8800799756c0), but was ffff88005d42dcb0. (prev=ffff88005b278eb0). Dec 11 00:38:47 karlalex kernel: [ 2016.175206] Modules linked in: tcp_lp fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat vboxpci(O) xt_CHECKSUM vboxnetadp(O) vboxnetflt(O) iptable_mangle tun bridge stp llc vboxdrv(O) lockd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep ppdev snd_seq snd_seq_device snd_pcm snd_timer snd soundcore microcode snd_page_alloc pcspkr i2c_i801 iTCO_wdt iTCO_vendor_support r8169 mii uinput parport_pc parport floppy sunrpc i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] Dec 11 00:38:47 karlalex kernel: [ 2016.175254] Pid: 1868, comm: gnome-shell Tainted: G O 3.2.0-rc5 #18 Dec 11 00:38:47 karlalex kernel: [ 2016.175256] Call Trace: Dec 11 00:38:47 karlalex kernel: [ 2016.175263] [<ffffffff8105a3e8>] warn_slowpath_common+0x83/0x9b Dec 11 00:38:47 karlalex kernel: [ 2016.175266] [<ffffffff8105a4a3>] warn_slowpath_fmt+0x46/0x48 Dec 11 00:38:47 karlalex kernel: [ 2016.175270] [<ffffffff810502f2>] ? get_parent_ip+0xe/0x3e Dec 11 00:38:47 karlalex kernel: [ 2016.175273] [<ffffffff812418ef>] __list_add+0x66/0x7f Dec 11 00:38:47 karlalex kernel: [ 2016.175290] [<ffffffffa007560f>] list_move_tail+0x27/0x2c [i915] Dec 11 00:38:47 karlalex kernel: [ 2016.175301] [<ffffffffa00757cb>] i915_gem_retire_requests_ring+0xef/0x177 [i915] Dec 11 00:38:47 karlalex kernel: [ 2016.175312] [<ffffffffa0076cbc>] i915_wait_request+0x401/0x447 [i915] Dec 11 00:38:47 karlalex kernel: [ 2016.175316] [<ffffffff810502f2>] ? get_parent_ip+0xe/0x3e Dec 11 00:38:47 karlalex kernel: [ 2016.175318] [<ffffffff810502f2>] ? get_parent_ip+0xe/0x3e Dec 11 00:38:47 karlalex kernel: [ 2016.175321] [<ffffffff810502f2>] ? get_parent_ip+0xe/0x3e Dec 11 00:38:47 karlalex kernel: [ 2016.175333] [<ffffffffa0076d33>] i915_gem_object_wait_rendering+0x31/0x33 [i915] Dec 11 00:38:47 karlalex kernel: [ 2016.175344] [<ffffffffa0077db2>] i915_gem_object_set_to_gtt_domain+0x53/0xd6 [i915] Dec 11 00:38:47 karlalex kernel: [ 2016.175355] [<ffffffffa0077ec8>] i915_gem_set_domain_ioctl+0x93/0xcb [i915] Dec 11 00:38:47 karlalex kernel: [ 2016.175368] [<ffffffffa001d78a>] drm_ioctl+0x2bf/0x397 [drm] Dec 11 00:38:47 karlalex kernel: [ 2016.175371] [<ffffffff811ee50b>] ? avc_has_perm_flags+0x61/0x7a Dec 11 00:38:47 karlalex kernel: [ 2016.175383] [<ffffffffa0077e35>] ? i915_gem_object_set_to_gtt_domain+0xd6/0xd6 [i915] Dec 11 00:38:47 karlalex kernel: [ 2016.175386] [<ffffffff811ef0a5>] ? inode_has_perm+0x32/0x34 Dec 11 00:38:47 karlalex kernel: [ 2016.175389] [<ffffffff811ef14e>] ? file_has_perm+0xa7/0xc9 Dec 11 00:38:47 karlalex kernel: [ 2016.175394] [<ffffffff8113f1bb>] do_vfs_ioctl+0x415/0x456 Dec 11 00:38:47 karlalex kernel: [ 2016.175397] [<ffffffff8113f252>] sys_ioctl+0x56/0x7c Dec 11 00:38:47 karlalex kernel: [ 2016.175402] [<ffffffff814deb02>] system_call_fastpath+0x16/0x1b Dec 11 00:38:47 karlalex kernel: [ 2016.175404] ---[ end trace 9a493f8550a2caf6 ]--- Dec 11 00:38:47 karlalex kernel: [ 2016.428775] ------------[ cut here ]------------
Both times, I had just turned on the machine and booted with -rc5. The first time, I rebooted into the stock kernel, then into -rc5 again, and I did not get any corruption. It seems that having just turned on the machine (as opposed to hard/soft rebooting) has something to do with this, but I could be wrong.
Output of lspci -v is attached.
Virtualbox module is compiled and loaded, but never used in either case. This module is also loaded in -rc4 and caused no problems for me.
On Sun, 11 Dec 2011 13:36:30 -0500, Alex Villacís Lasso a_villacis@palosanto.com wrote:
The only patch in the i915 code between rc4 and rc5 is a tiny VT-d workaround fix for ILK machines, which does fiddle with how the request linked lists are managed.
Can you try reverting eb1711bb94991e93669c5a1b5f84f11be2d51ea1 and see if your problem goes away?
On Mon, 12 Dec 2011 09:51:19 -0500, Alex Villacís Lasso a_villacis@palosanto.com wrote:
Ran kernel with reverted patch for 6 hours without issues so far. Will keep testing after work (issue happens with my home machine).
Thanks much. Let me know if it's still stable this evening; I can send a revert along if you don't find any problems.
El 12/12/11 11:41, Keith Packard escribió:
I just had a severe problem, but I am not sure if the patch (or its revert) is at fault.
I was running 3.2-rc5 at home, for an hour or so, when suddenly I could not launch any programs. I tried switching to the text console, but the login program restarted itself after typing "root", without waiting for the password. I then tried to reboot, but all of the installed kernels (even the stock Fedora ones) issued a kernel panic very early in the boot sequence, mentioning an attempt to kill init. By using a bootable USB stick, I could check the logs, which showed many segfaults at /lib64/ld-2.14.90.so . Even though running fsck -f on my / and /boot partitions (both ext4) showed no errors besides an unclean shutdown, the kernel panics on boot persisted. I eventually reinstalled the system from scratch, and kept my /home partition so that no important data was lost. I am still in the process of restoring my package list to the state before the crash.
Maybe the list corruption I experienced earlier was secondary damage, which now spread somewhere else and corrupted system files, but I do not have enough data to check this.
On Tue, 13 Dec 2011 10:14:15 -0500, Alex Villacís Lasso a_villacis@palosanto.com wrote:
By using a bootable USB stick, I could check the logs, which showed many segfaults at /lib64/ld-2.14.90.so .
Ouch!
Please let me know if you find anything further; I'd like to get a revert sent upstream in the next day or so.
On Tue, Dec 13, 2011 at 10:14:46AM -0800, Keith Packard wrote:
I think the revert is trtd. But if you revert it, please also revert/disable the ilk vt-d workaound or apply one of Ben's patches, because that one _does_ blow up, too. -Daniel
On Tue, 13 Dec 2011 19:26:50 +0100, Daniel Vetter daniel@ffwll.ch wrote:
Only if VT-d is enabled though, and that patch is now old enough that reverting it may cause additional problems.
Ben's patches still appear to have problems -- they don't appear to resolve the infinite recursion issue for unknown reasons.
I'm going to revert the patch which causes the reported regression, then wait for Eric to finish up his request queue cleanups and revisit this problem after that.
On Tue, Dec 13, 2011 at 10:14:46AM -0800, Keith Packard wrote:
Another patch to try is "drm/i915: Only clear the GPU domains upon a successful finish" from the my-next branch available at:
http://cgit.freedesktop.org/~danvet/drm/log/?h=my-next
Or just the raw patch:
http://cgit.freedesktop.org/~danvet/drm/patch/?id=389a55581e30607af0fcde6cdb...
Thanks, Daniel
dri-devel@lists.freedesktop.org