After around 10-30 mins of typical usage (ie cairo rendering to surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon locks up, giving a blank screen.
The only information I have is a backtrace [1]. 'radeontool regs' doesn't give anything - what information, /sys files etc would be useful here, and would it be useful to log a freedesktop.org bug report at this stage, assuming I haven't found a reliable reproducer?
OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
Thanks, Daniel
--- [1]
radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec ------------[ cut here ]------------ WARNING: at /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235 radeon_fence_wait+0x2b7/0x320 [radeon]() Hardware name: Studio 1557 GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33) Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211 i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c libcrc32c Pid: 1624, comm: Xorg Tainted: G D 2.6.35-020635-generic #020635 Call Trace: ? radeon_fence_wait+0x2b7/0x320 [radeon] warn_slowpath_common+0x90/0xc0 warn_slowpath_fmt+0x6e/0x70 ? schedule_timeout+0x15a/0x2e0 ? r600_irq_set+0x27d/0xc00 [radeon] ? radeon_ring_commit+0xa3/0xb0 [radeon] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] radeon_fence_wait+0x2b7/0x320 [radeon] ? autoremove_wake_function+0x0/0x40 radeon_sync_obj_wait+0x11/0x20 [radeon] ttm_bo_wait+0x102/0x1b0 [ttm] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] radeon_move_blit+0x124/0x170 [radeon] radeon_bo_move+0xda/0x1a0 [radeon] ttm_bo_handle_move_mem+0xec/0x370 [ttm] ttm_bo_evict+0x1cc/0x270 [ttm] ? drm_mm_split_at_start+0x1d/0x80 [drm] ttm_mem_evict_first+0xed/0x180 [ttm] ? ttm_bo_man_get_node+0xd5/0xe0 [ttm] ttm_bo_mem_force_space+0xab/0x110 [ttm] ttm_bo_mem_space+0x305/0x370 [ttm] ttm_bo_move_buffer+0xcf/0x140 [ttm] ? drm_mm_split_at_start+0x1d/0x80 [drm] ttm_bo_validate+0xd5/0x100 [ttm] ttm_bo_init+0x1f2/0x240 [ttm] radeon_bo_create+0x121/0x240 [radeon] ? radeon_ttm_bo_destroy+0x0/0x80 [radeon] radeon_gem_object_create+0x89/0x100 [radeon] ? drm_gem_handle_create+0x8b/0xa0 [drm] radeon_gem_create_ioctl+0x58/0xe0 [radeon] drm_ioctl+0x283/0x460 [drm] ? radeon_gem_create_ioctl+0x0/0xe0 [radeon] ? do_readv_writev+0x16b/0x1e0 vfs_ioctl+0x3a/0xc0 do_vfs_ioctl+0x6d/0x1f0 sys_ioctl+0x87/0xa0 ? sys_writev+0x55/0xb0 system_call_fastpath+0x16/0x1b ---[ end trace c0bc12025fa8386c ]--- radeon 0000:02:00.0: GPU softreset radeon 0000:02:00.0: R_008010_GRBM_STATUS=0xA0003028 radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEE radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001 radeon 0000:02:00.0: R_008010_GRBM_STATUS=0x00003028 radeon 0000:02:00.0: R_008014_GRBM_STATUS2=0x00000002 radeon 0000:02:00.0: R_000E50_SRBM_STATUS=0x200000C0 radeon 0000:02:00.0: GPU reset succeed Clocks initialized ! ring test succeeded in 1 usecs ib test succeeded in 1 usecs
On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
SO it keep reseting ? According to log there is a GPU lockup but then a successfull GPU reset so GPU should resume fine (that's what log says) Best is to open a bug and attach full dmesg + lspci -vv and context in which the lockup happen
Cheers, Jerome
On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse glisse@freedesktop.org wrote:
Perhaps the lockup timeout is too short or interrupts are being delivered late or something? There's a fdo or kernel bugzilla entry where removing the lockup timeout fixed the issue. Does something like this:
--- a/drivers/gpu/drm/radeon/radeon_fence.c +++ b/drivers/gpu/drm/radeon/radeon_fence.c @@ -237,9 +237,11 @@ retry: * as signaled for now */ rdev->gpu_lockup = true; +#if 0 r = radeon_gpu_reset(rdev); if (r) return r; +#endif WREG32(rdev->fence_drv.scratch_reg, fence->seq); rdev->gpu_lockup = false; }
fix the issue?
Alex
On 5 August 2010 16:23, Alex Deucher alexdeucher@gmail.com wrote:
Correction, RV710.
Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already excruciatingly long for the command processor to not have responded, but the issue I'm really hitting here is that X if left hanging after GPU reset. The GPU lockup was being reported at:
[<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon] [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0 [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70 [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0 [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon] [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon] [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon] [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon] [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon] [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
X consistently hangs after GPU reset at (with sysrq-T):
[<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170 [<ffffffff8157647b>] mutex_lock+0x2b/0x50 [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon] [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40 [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm] [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0 [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon] [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon] [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper] [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper] [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon] [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon] [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon] [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm] [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm] [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
I couldn't reproduce it (but did hit it a number of times over the last two weeks), so have been analysing with what we have. The kernel the backtrace is against isn't built with debug, so:
radeon_pm_set_clocks(): 8ca80: 55 push %rbp 8ca81: 48 89 e5 mov %rsp,%rbp 8ca84: 41 57 push %r15 8ca86: 41 56 push %r14 8ca88: 41 55 push %r13 8ca8a: 41 54 push %r12 8ca8c: 53 push %rbx 8ca8d: 48 83 ec 48 sub $0x48,%rsp 8ca91: e8 00 00 00 00 callq 8ca96 <radeon_pm_set_clocks+0x16> 8ca96: 49 89 fc mov %rdi,%r12 8ca99: 48 8b 7f 08 mov 0x8(%rdi),%rdi 8ca9d: 48 83 c7 20 add $0x20,%rdi 8caa1: e8 00 00 00 00 callq 8caa6 <radeon_pm_set_clocks+0x26> 8caa6: 49 8d 84 24 a0 1c 00 lea 0x1ca0(%r12),%rax 8caad: 00 8caae: 48 89 c7 mov %rax,%rdi 8cab1: 48 89 45 90 mov %rax,-0x70(%rbp) 8cab5: e8 00 00 00 00 callq 8caba <radeon_pm_set_clocks+0x3a> -> 8caba: 49 8d 8c 24 e8 09 00 lea 0x9e8(%r12),%rcx 8cac1: 00 8cac2: 48 89 cf mov %rcx,%rdi 8cac5: 48 89 4d 98 mov %rcx,-0x68(%rbp) 8cac9: e8 00 00 00 00 callq 8cace <radeon_pm_set_clocks+0x4e> 8cace: 41 83 7c 24 78 1a cmpl $0x1a,0x78(%r12) 8cad4: 0f 86 46 01 00 00 jbe 8cc20 <radeon_pm_set_clocks+0x1a0> 8cada: 41 80 bc 24 d0 0d 00 cmpb $0x0,0xdd0(%r12) 8cae1: 00 00
This corresponds to:
static void radeon_pm_set_clocks(struct radeon_device *rdev) { int i;
mutex_lock(&rdev->ddev->struct_mutex); mutex_lock(&rdev->vram_mutex); -> mutex_lock(&rdev->cp.mutex);
We know r12 is the rdev struct pointer (%rdi is the first function argument gets loaded into r12), and 0x9e8 is plausable as the offset of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so the code is trying to acquire cp.mutex it seems.
From the kernel log, the command proc ring test completed after the
start of the GPU reset, which is the last callsite to unlock cp.mutex, ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit> before the hang at <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>.
Looking through all the calls between these two spots, I just can't see anything that _could_ have locked cp.mutex, unless I'm misreading the disassembly of mutex_lock.
Any thoughts?
Dan
On 8 August 2010 17:36, Daniel J Blueman daniel.blueman@gmail.com wrote:
Well, looking further back down the call chain, radeon_bo_create takes the vram_mutex across the ttm_bo_init call, so I was duped by the compiler's instruction scheduling.
I looked at various changes to the locking scheme which did address the X hang (thus blackouts), but all attempts failed with lockdep violations somewhere, so there is no minimal fix without revisiting recent locking scheme changes [1,2,3].
I'll pass on some other minor fixes I've cooked up in the meantime. Note that I am running with the kernel boot options 'dynpm=1 radeon.dynclks=1' to stop everything getting toasty; let me know if you need more info for this issue.
Thanks, Daniel
[1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917 [2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923 [3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937
Hi chaps,
On 9 August 2010 14:41, Daniel J Blueman daniel.blueman@gmail.com wrote:
I'm hitting this issue again on 2.6.36-rc1; it reproduces the 1000ms timeout and subsequent hang every time on my RV710 when starting a game on doom3 (eg http://ftp.jeuxlinux.fr/files/doom3-linux-1.1.1286-demo.x86.run), even without the dympm/dynclks options.
It looks like the locking strategy of the Matthew's patches needs reworking. I was able to reproduce the hang by forcing a GPU reset also.
What else would help with this?
Thanks, Daniel
On Tue, Aug 17, 2010 at 6:12 PM, Daniel J Blueman daniel.blueman@gmail.com wrote:
Does this patch help?
http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=commitdiff;h...
It fixes several other similar bugs.
Alex
On 17 August 2010 23:37, Alex Deucher alexdeucher@gmail.com wrote:
Yes, sorry for the delays replying. I still didn't get time to analyse other call-paths this can affect, and thus the possibility of a pending power mode change causing the same deadlock.
That aside, this patch does prevent the deadlocks I was able to reproduce, including when the CS parser detects issues in user-submitted CS packets.
Thanks, Daniel
dri-devel@lists.freedesktop.org