On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628! Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input Nov 8 19:28:23 arch kernel: CPU 1 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246 Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78 Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040 Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001 Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000 Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000 Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0 Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20) Nov 8 19:28:23 arch kernel: Stack: Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600 Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4 Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]--- Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1 Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
Quite puzzling it is as if there was already a bo at same offset in rb tree but not in vm mm. Maybe some other race in destruction...
Cheers, Jerome Glisse
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628! Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input Nov 8 19:28:23 arch kernel: CPU 1 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246 Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78 Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040 Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001 Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000 Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000 Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0 Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20) Nov 8 19:28:23 arch kernel: Stack: Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600 Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4 Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]--- Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1 Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification & addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm & rb tree state when BUG get trigger to try to understand states of things.
Cheers, Jerome
On Monday, November 08, 2010, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628! Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input Nov 8 19:28:23 arch kernel: CPU 1 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246 Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78 Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040 Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001 Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000 Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000 Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0 Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20) Nov 8 19:28:23 arch kernel: Stack: Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600 Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4 Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]--- Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1 Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification & addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm & rb tree state when BUG get trigger to try to understand states of things.
Hmm, why are you using BUG in there in the first place? Would it be _so_ dangerous to continue that we just have to crash here?
Rafael
On Mon, Nov 8, 2010 at 3:58 PM, Rafael J. Wysocki rjw@sisk.pl wrote:
On Monday, November 08, 2010, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628! Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input Nov 8 19:28:23 arch kernel: CPU 1 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246 Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78 Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040 Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001 Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000 Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000 Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0 Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20) Nov 8 19:28:23 arch kernel: Stack: Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600 Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4 Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]--- Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1 Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification & addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm & rb tree state when BUG get trigger to try to understand states of things.
Hmm, why are you using BUG in there in the first place? Would it be _so_ dangerous to continue that we just have to crash here?
Rafael
This case should _never happen, i guess we could return an error and refuse to create bo _but to me it seems that this case is the result of corrupted rb or mm structure, so everythings might fall off in more subtle way if we bail out in front of this error.
Jerome
On 11/08/2010 09:58 PM, Rafael J. Wysocki wrote:
On Monday, November 08, 2010, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628! Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input Nov 8 19:28:23 arch kernel: CPU 1 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246 Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78 Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040 Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001 Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000 Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000 Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0 Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20) Nov 8 19:28:23 arch kernel: Stack: Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600 Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4 Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3<0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP<ffff88011b0fbbe8> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]--- Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1 Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification& addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm& rb tree state when BUG get trigger to try to understand states of things.
Hmm, why are you using BUG in there in the first place? Would it be _so_ dangerous to continue that we just have to crash here?
Rafael
BUGs in the TTM module are there to catch incorrect usage of the TTM API, and the intention is that they should only happen during development or stabilizing phases. In this case, we're probably seeing the symptoms of memory corruption or a buggy range manager change.
/Thomas
On 11/08/2010 09:53 PM, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628! Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input Nov 8 19:28:23 arch kernel: CPU 1 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246 Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78 Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040 Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001 Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000 Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000 Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0 Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20) Nov 8 19:28:23 arch kernel: Stack: Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600 Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4 Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3<0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: RSP<ffff88011b0fbbe8> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]--- Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1 Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002 Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31 Nov 8 19:28:23 arch kernel: Call Trace: Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850 Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30 Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40 Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0 Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0 Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0 Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760 Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45 Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150 Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0 Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110 Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20 Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340 Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250 Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0 Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130 Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120 Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450 Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610 Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80 Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90 Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification& addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm& rb tree state when BUG get trigger to try to understand states of things.
Cheers, Jerome
I agree there shouldn't be a race in this case. The locking around these operations is simple and straightforward.
So this IMHO should either be a memory corruption or a bug in the range manager. I've never seen this BUG trigger before. Dumping mm / rb tree contents or bisecting should probably find the culprit.
/Thomas
On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
On 11/08/2010 09:53 PM, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
I can trigger a kernel crash on my system by simply loading this png image with firefox: http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0...
Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification& addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm& rb tree state when BUG get trigger to try to understand states of things.
I agree there shouldn't be a race in this case. The locking around these operations is simple and straightforward.
So this IMHO should either be a memory corruption or a bug in the range manager. I've never seen this BUG trigger before. Dumping mm / rb tree contents or bisecting should probably find the culprit.
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit commit e376573f7267390f4e1bdc552564b6fb913bce76 Author: Michel Dänzer daenzer@vmware.com Date: Thu Jul 8 12:43:28 2010 +1000
drm/radeon: fall back to GTT if bo creation/validation in VRAM fails.
This fixes a problem where on low VRAM cards we'd run out of space for validation.
[airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.]
Signed-off-by: Michel Dänzer daenzer@vmware.com Cc: stable@kernel.org Signed-off-by: Dave Airlie airlied@redhat.com
Please note that this is an old commit from 2.6.36-rc. When I revert it the kernel no longer crashes. Instead I see the following in my dmesg:
[TTM] Failed to find memory space for buffer 0xffff880113e10e48 eviction. [TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M) [TTM] placement[0]=0x00070002 (1) [TTM] has_type: 1 [TTM] use_type: 1 [TTM] flags: 0x0000000A [TTM] gpu_offset: 0xA0000000 [TTM] size: 131072 [TTM] available_caching: 0x00070000 [TTM] default_caching: 0x00010000 [TTM] 0x00000000-0x00000001: 1: used [TTM] 0x00000001-0x00000011: 16: used [TTM] 0x00000011-0x00000111: 256: used [TTM] 0x00000111-0x00000211: 256: used [TTM] 0x00000211-0x00000248: 55: free [TTM] 0x00000248-0x0000024c: 4: used [TTM] 0x0000024c-0x00001976: 5930: free [TTM] 0x00001976-0x000021aa: 2100: used [TTM] 0x000021aa-0x0000285f: 1717: free [TTM] 0x0000285f-0x00002860: 1: used [TTM] 0x00002860-0x00002873: 19: free [TTM] 0x00002873-0x000029b3: 320: used [TTM] 0x000029b3-0x00020000: 120397: free [TTM] total: 131072, used 2954 free 128118 [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12! radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) ...
And the following in the xorg log buffer:
Failed to alloc memory Failed to allocat: size: : 117555200 bytes alignment : 0 bytes domains : 4 ...
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
On 11/08/2010 09:53 PM, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
> I can trigger a kernel crash on my system by simply loading this png > image with firefox: > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0... > Sorry the above link is wrong, this is the right one (that triggers the crash): http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification& addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm& rb tree state when BUG get trigger to try to understand states of things.
I agree there shouldn't be a race in this case. The locking around these operations is simple and straightforward.
So this IMHO should either be a memory corruption or a bug in the range manager. I've never seen this BUG trigger before. Dumping mm / rb tree contents or bisecting should probably find the culprit.
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit commit e376573f7267390f4e1bdc552564b6fb913bce76 Author: Michel Dänzerdaenzer@vmware.com Date: Thu Jul 8 12:43:28 2010 +1000
drm/radeon: fall back to GTT if bo creation/validation in VRAM fails. This fixes a problem where on low VRAM cards we'd run out of space for validation. [airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.] Signed-off-by: Michel Dänzer<daenzer@vmware.com> Cc: stable@kernel.org Signed-off-by: Dave Airlie<airlied@redhat.com>
Please note that this is an old commit from 2.6.36-rc. When I revert it the kernel no longer crashes. Instead I see the following in my dmesg:
Hmm, so this sounds like something in the Radeon eviction error path is causing corruption. I had a similar problem with vmwgfx, when I tried to unref a BO _after_ ttm_bo_init() failed. ttm_bo_init() is really supposed to call unref itself for various reasons, so calling unref() or kfree() after a failed ttm_bo_init() will cause corruption.
In any case, the error below also suggests something is a bit fragile in the Radeon driver:
First, an accelerated eviction may fail, like in the message below, but then there must always be a backup plan, like unaccelerated eviction to system. On BO creation, there are a number of placement strategies, but if all else fails, it should be possible to initially place the BO in system memory.
Second, If bo validation fails during a command submission, due to insufficient VRAM / TT, then the driver should retry the complete validation cycle after first blocking all other validators and then evicting everything not pinned, to avoid failures due to fragmentation.
/Thomas
[TTM] Failed to find memory space for buffer 0xffff880113e10e48 eviction. [TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M) [TTM] placement[0]=0x00070002 (1) [TTM] has_type: 1 [TTM] use_type: 1 [TTM] flags: 0x0000000A [TTM] gpu_offset: 0xA0000000 [TTM] size: 131072 [TTM] available_caching: 0x00070000 [TTM] default_caching: 0x00010000 [TTM] 0x00000000-0x00000001: 1: used [TTM] 0x00000001-0x00000011: 16: used [TTM] 0x00000011-0x00000111: 256: used [TTM] 0x00000111-0x00000211: 256: used [TTM] 0x00000211-0x00000248: 55: free [TTM] 0x00000248-0x0000024c: 4: used [TTM] 0x0000024c-0x00001976: 5930: free [TTM] 0x00001976-0x000021aa: 2100: used [TTM] 0x000021aa-0x0000285f: 1717: free [TTM] 0x0000285f-0x00002860: 1: used [TTM] 0x00002860-0x00002873: 19: free [TTM] 0x00002873-0x000029b3: 320: used [TTM] 0x000029b3-0x00020000: 120397: free [TTM] total: 131072, used 2954 free 128118 [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12! radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) ...
And the following in the xorg log buffer:
Failed to alloc memory Failed to allocat: size: : 117555200 bytes alignment : 0 bytes domains : 4 ...
On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
On 11/08/2010 09:53 PM, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote: > On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf > wrote: >> I can trigger a kernel crash on my system by simply loading >> this png >> image with firefox: >> http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_0... >> > Sorry the above link is wrong, this is the right one (that > triggers the > crash): > http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png > I triggered it a few more times and took the attached picture. It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 . (Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):
Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------ Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
Thomas this bug seems to point to a case where we endup trying adding an entry to same offset in the rb tree for addr_space_mm. After reviewing carefully the locking around the rb tree modification& addr_space_mm i am fairly confident that no race can occur. Would you have any idea on what might go wrong here ? I guess i would ultimately need to dump mm& rb tree state when BUG get trigger to try to understand states of things.
I agree there shouldn't be a race in this case. The locking around these operations is simple and straightforward.
So this IMHO should either be a memory corruption or a bug in the range manager. I've never seen this BUG trigger before. Dumping mm / rb tree contents or bisecting should probably find the culprit.
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit commit e376573f7267390f4e1bdc552564b6fb913bce76 Author: Michel Dänzerdaenzer@vmware.com Date: Thu Jul 8 12:43:28 2010 +1000
drm/radeon: fall back to GTT if bo creation/validation in VRAM
fails.
This fixes a problem where on low VRAM cards we'd run out of
space for validation.
[airlied: Tested on my M7, Thinkpad T42, compiz works with no
problems.]
Signed-off-by: Michel Dänzer<daenzer@vmware.com> Cc: stable@kernel.org Signed-off-by: Dave Airlie<airlied@redhat.com>
Please note that this is an old commit from 2.6.36-rc. When I revert it the kernel no longer crashes. Instead I see the following in my dmesg:
Hmm, so this sounds like something in the Radeon eviction error path is causing corruption. I had a similar problem with vmwgfx, when I tried to unref a BO _after_ ttm_bo_init() failed. ttm_bo_init() is really supposed to call unref itself for various reasons, so calling unref() or kfree() after a failed ttm_bo_init() will cause corruption.
In any case, the error below also suggests something is a bit fragile in the Radeon driver:
First, an accelerated eviction may fail, like in the message below, but then there must always be a backup plan, like unaccelerated eviction to system. On BO creation, there are a number of placement strategies, but if all else fails, it should be possible to initially place the BO in system memory.
Second, If bo validation fails during a command submission, due to insufficient VRAM / TT, then the driver should retry the complete validation cycle after first blocking all other validators and then evicting everything not pinned, to avoid failures due to fragmentation.
/Thomas
Indeed, it seems like the commit you mention just retries ttm_bo_init() after it previously failed. At that point the bo has been destroyed, so that is probably what's causing the BUG you are seeing.
Admittedly, ttm_bo_init() calling unref on failure is not properly documented in the function description. The reason for doing so is to have a single path for freeing all BO resources already allocated on the point of failure.
/Thomas
[TTM] Failed to find memory space for buffer 0xffff880113e10e48 eviction. [TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M) [TTM] placement[0]=0x00070002 (1) [TTM] has_type: 1 [TTM] use_type: 1 [TTM] flags: 0x0000000A [TTM] gpu_offset: 0xA0000000 [TTM] size: 131072 [TTM] available_caching: 0x00070000 [TTM] default_caching: 0x00010000 [TTM] 0x00000000-0x00000001: 1: used [TTM] 0x00000001-0x00000011: 16: used [TTM] 0x00000011-0x00000111: 256: used [TTM] 0x00000111-0x00000211: 256: used [TTM] 0x00000211-0x00000248: 55: free [TTM] 0x00000248-0x0000024c: 4: used [TTM] 0x0000024c-0x00001976: 5930: free [TTM] 0x00001976-0x000021aa: 2100: used [TTM] 0x000021aa-0x0000285f: 1717: free [TTM] 0x0000285f-0x00002860: 1: used [TTM] 0x00002860-0x00002873: 19: free [TTM] 0x00002873-0x000029b3: 320: used [TTM] 0x000029b3-0x00020000: 120397: free [TTM] total: 131072, used 2954 free 128118 [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12! radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12) radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004) ...
And the following in the xorg log buffer:
Failed to alloc memory Failed to allocat: size: : 117555200 bytes alignment : 0 bytes domains : 4 ...
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit commit e376573f7267390f4e1bdc552564b6fb913bce76 Author: Michel Dänzerdaenzer@vmware.com Date: Thu Jul 8 12:43:28 2010 +1000
drm/radeon: fall back to GTT if bo creation/validation in VRAM
fails.
This fixes a problem where on low VRAM cards we'd run out of
space for validation.
[airlied: Tested on my M7, Thinkpad T42, compiz works with no
problems.]
Signed-off-by: Michel Dänzer<daenzer@vmware.com> Cc: stable@kernel.org Signed-off-by: Dave Airlie<airlied@redhat.com>
Please note that this is an old commit from 2.6.36-rc. When I revert it the kernel no longer crashes. Instead I see the following in my dmesg:
Hmm, so this sounds like something in the Radeon eviction error path is causing corruption. I had a similar problem with vmwgfx, when I tried to unref a BO _after_ ttm_bo_init() failed. ttm_bo_init() is really supposed to call unref itself for various reasons, so calling unref() or kfree() after a failed ttm_bo_init() will cause corruption.
In any case, the error below also suggests something is a bit fragile in the Radeon driver:
First, an accelerated eviction may fail, like in the message below, but then there must always be a backup plan, like unaccelerated eviction to system. On BO creation, there are a number of placement strategies, but if all else fails, it should be possible to initially place the BO in system memory.
Second, If bo validation fails during a command submission, due to insufficient VRAM / TT, then the driver should retry the complete validation cycle after first blocking all other validators and then evicting everything not pinned, to avoid failures due to fragmentation.
/Thomas
Indeed, it seems like the commit you mention just retries ttm_bo_init() after it previously failed. At that point the bo has been destroyed, so that is probably what's causing the BUG you are seeing.
Admittedly, ttm_bo_init() calling unref on failure is not properly documented in the function description. The reason for doing so is to have a single path for freeing all BO resources already allocated on the point of failure.
Does the patch below fix the problem?
commit e224472eedbda391ddb6d8b88f26e82e1c3b036b Author: Michel Dänzer daenzer@vmware.com Date: Tue Nov 9 11:30:41 2010 +0100
drm/radeon/kms: Fix retrying ttm_bo_init() after it failed once.
If ttm_bo_init() returns failure, it already destroyed the BO, so we need to retry from scratch.
Signed-off-by: Michel Dänzer daenzer@vmware.com Cc: stable@kernel.org
diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index 1b9004e..bbe92d5 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -102,6 +102,8 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj, type = ttm_bo_type_device; } *bo_ptr = NULL; + +retry: bo = kzalloc(sizeof(struct radeon_bo), GFP_KERNEL); if (bo == NULL) return -ENOMEM; @@ -109,8 +111,6 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj, bo->gobj = gobj; bo->surface_reg = -1; INIT_LIST_HEAD(&bo->list); - -retry: radeon_ttm_placement_from_domain(bo, domain); /* Kernel allocation are uninterruptible */ mutex_lock(&rdev->vram_mutex);
On Tue, Nov 09, 2010 at 11:32:57AM +0100, Michel Dänzer wrote:
On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit commit e376573f7267390f4e1bdc552564b6fb913bce76 Author: Michel Dänzerdaenzer@vmware.com Date: Thu Jul 8 12:43:28 2010 +1000
drm/radeon: fall back to GTT if bo creation/validation in VRAM
fails.
This fixes a problem where on low VRAM cards we'd run out of
space for validation.
[airlied: Tested on my M7, Thinkpad T42, compiz works with no
problems.]
Signed-off-by: Michel Dänzer<daenzer@vmware.com> Cc: stable@kernel.org Signed-off-by: Dave Airlie<airlied@redhat.com>
Please note that this is an old commit from 2.6.36-rc. When I revert it the kernel no longer crashes. Instead I see the following in my dmesg:
Hmm, so this sounds like something in the Radeon eviction error path is causing corruption. I had a similar problem with vmwgfx, when I tried to unref a BO _after_ ttm_bo_init() failed. ttm_bo_init() is really supposed to call unref itself for various reasons, so calling unref() or kfree() after a failed ttm_bo_init() will cause corruption.
In any case, the error below also suggests something is a bit fragile in the Radeon driver:
First, an accelerated eviction may fail, like in the message below, but then there must always be a backup plan, like unaccelerated eviction to system. On BO creation, there are a number of placement strategies, but if all else fails, it should be possible to initially place the BO in system memory.
Second, If bo validation fails during a command submission, due to insufficient VRAM / TT, then the driver should retry the complete validation cycle after first blocking all other validators and then evicting everything not pinned, to avoid failures due to fragmentation.
/Thomas
Indeed, it seems like the commit you mention just retries ttm_bo_init() after it previously failed. At that point the bo has been destroyed, so that is probably what's causing the BUG you are seeing.
Admittedly, ttm_bo_init() calling unref on failure is not properly documented in the function description. The reason for doing so is to have a single path for freeing all BO resources already allocated on the point of failure.
Does the patch below fix the problem?
Yes, indeed. I was just about to send the same patch to the list.
Thanks.
On Die, 2010-11-09 at 11:37 +0100, Markus Trippelsdorf wrote:
On Tue, Nov 09, 2010 at 11:32:57AM +0100, Michel Dänzer wrote:
On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit commit e376573f7267390f4e1bdc552564b6fb913bce76 Author: Michel Dänzerdaenzer@vmware.com Date: Thu Jul 8 12:43:28 2010 +1000
drm/radeon: fall back to GTT if bo creation/validation in VRAM
fails.
This fixes a problem where on low VRAM cards we'd run out of
space for validation.
[airlied: Tested on my M7, Thinkpad T42, compiz works with no
problems.]
Signed-off-by: Michel Dänzer<daenzer@vmware.com> Cc: stable@kernel.org Signed-off-by: Dave Airlie<airlied@redhat.com>
Please note that this is an old commit from 2.6.36-rc. When I revert it the kernel no longer crashes. Instead I see the following in my dmesg:
Hmm, so this sounds like something in the Radeon eviction error path is causing corruption. I had a similar problem with vmwgfx, when I tried to unref a BO _after_ ttm_bo_init() failed. ttm_bo_init() is really supposed to call unref itself for various reasons, so calling unref() or kfree() after a failed ttm_bo_init() will cause corruption.
In any case, the error below also suggests something is a bit fragile in the Radeon driver:
First, an accelerated eviction may fail, like in the message below, but then there must always be a backup plan, like unaccelerated eviction to system. On BO creation, there are a number of placement strategies, but if all else fails, it should be possible to initially place the BO in system memory.
Second, If bo validation fails during a command submission, due to insufficient VRAM / TT, then the driver should retry the complete validation cycle after first blocking all other validators and then evicting everything not pinned, to avoid failures due to fragmentation.
/Thomas
Indeed, it seems like the commit you mention just retries ttm_bo_init() after it previously failed. At that point the bo has been destroyed, so that is probably what's causing the BUG you are seeing.
Admittedly, ttm_bo_init() calling unref on failure is not properly documented in the function description. The reason for doing so is to have a single path for freeing all BO resources already allocated on the point of failure.
Does the patch below fix the problem?
Yes, indeed. I was just about to send the same patch to the list.
Thanks.
Thank you for testing / confirming the fix, and to Thomas for the analysis of the problem.
I've submitted the fix to Dave with your Tested-by: added.
I've hit a new BUG using the same trigger (huge png image in firefox, zoom in and zoom out while the picture is loading):
Nov 27 09:56:14 [kernel] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1134! Nov 27 09:56:14 [kernel] CPU 3 Nov 27 09:56:14 [kernel] Pid: 1867, comm: X Not tainted 2.6.37-rc3-00215-g94ece42-dirty #8 M4A78T-E/System Product Name Nov 27 09:56:14 [kernel] RIP: 0010:[<ffffffff8128b181>] [<ffffffff8128b181>] ttm_bo_check_placement+0x21/0x30 Nov 27 09:56:14 [kernel] RSP: 0018:ffff880118f99b98 EFLAGS: 00010206 Nov 27 09:56:14 [kernel] RAX: 0000000000008000 RBX: ffff88011c610020 RCX: ffffffff812c2d80 Nov 27 09:56:14 [kernel] RDX: 0000000008688000 RSI: ffff88011c610020 RDI: ffff88011c610048 Nov 27 09:56:14 [kernel] RBP: ffff880118f99b98 R08: ffff88011c610020 R09: 0000000000000001 Nov 27 09:56:14 [kernel] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88011ef30508 Nov 27 09:56:14 [kernel] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000008688 Nov 27 09:56:14 [kernel] FS: 00007ff13cc3e700(0000) GS:ffff8800dfd80000(0000) knlGS:0000000000000000 Nov 27 09:56:14 [kernel] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 27 09:56:14 [kernel] CR2: 00007ff13ca7b000 CR3: 000000011c455000 CR4: 00000000000006e0 Nov 27 09:56:14 [kernel] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 27 09:56:14 [kernel] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 27 09:56:14 [kernel] Process X (pid: 1867, threadinfo ffff880118f98000, task ffff88011fe24e00) Nov 27 09:56:14 [kernel] ffff880118f99be8 ffffffff8128d22d ffff880100000001 ffff88011c610048 Nov 27 09:56:14 [kernel] 0000000008688000 ffff88011c610000 ffff88011ef30000 ffff88011c610000 Nov 27 09:56:14 [kernel] 0000000000000000 0000000000000001 ffff880118f99cb8 ffffffff812c3041 Nov 27 09:56:14 [kernel] [<ffffffff8128d22d>] ttm_bo_init+0x17d/0x360 Nov 27 09:56:14 [kernel] [<ffffffff812c3041>] radeon_bo_create+0x151/0x2c0 Nov 27 09:56:14 [kernel] [<ffffffff812c2d80>] ? radeon_ttm_bo_destroy+0x0/0xc0 Nov 27 09:56:14 [kernel] [<ffffffff81120c80>] ? pollwake+0x0/0x60 Nov 27 09:56:14 [kernel] [<ffffffff812d8f1b>] radeon_gem_object_create+0x8b/0x110 Nov 27 09:56:14 [kernel] [<ffffffff8106b5db>] ? __wake_up+0x4b/0x60 Nov 27 09:56:14 [kernel] [<ffffffff812d9358>] radeon_gem_create_ioctl+0x58/0xd0 Nov 27 09:56:14 [kernel] [<ffffffff812d96ed>] ? radeon_gem_wait_idle_ioctl+0xed/0x110 Nov 27 09:56:14 [kernel] [<ffffffff81274efc>] drm_ioctl+0x3bc/0x490 Nov 27 09:56:14 [kernel] [<ffffffff812d9300>] ? radeon_gem_create_ioctl+0x0/0xd0 Nov 27 09:56:14 [kernel] [<ffffffff8111ffaa>] do_vfs_ioctl+0x9a/0x540 Nov 27 09:56:14 [kernel] [<ffffffff81039e9b>] system_call_fastpath+0x16/0x1b Nov 27 09:56:14 [kernel] RIP [<ffffffff8128b181>] ttm_bo_check_placement+0x21/0x30 Nov 27 09:56:14 [kernel] RSP <ffff880118f99b98> Nov 27 09:56:14 [kernel] ---[ end trace d14c91f5f64be308 ]--- Nov 27 09:57:40 [kernel] Emergency Sync complete
dri-devel@lists.freedesktop.org