https://bugs.freedesktop.org/show_bug.cgi?id=93341
Bug ID: 93341 Summary: GPU lockups on RadeonHD 7770 (radeonsi driver) when running OpenGL games Product: Mesa Version: 11.0 Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: major Priority: medium Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: nekohayo@gmail.com QA Contact: dri-devel@lists.freedesktop.org
Fedora 23, xorg-x11-drv-ati, on a Dell Precision T3500 (latest BIOS, A17) with a RadeonHD 7770 GPU. Running the latest up-to-date stock packages from Fedora.
If I start a game like Xonotic (from the Fedora repos) or Unvanquished (latest alpha binary build downloaded from their github repo), after a minute or two of just looking around as a spectator player, I'll eventually see my computer's monitor turn off all of a sudden. Sound will continue to play for a while, then it might stop/loop. After a few seconds, the kernel will be locked up with the CapsLock LED no longer working.
This also happened to me once simply by watching a video fullscreen in Totem (I'm running GNOME Shell, FWIW), but this is a much rarer occurrence.
Unfortunately I don't have knowledge of debugging such things, and ABRT somehow thinks my kernel is tainted with the "I" status (meaning it's "working around a severe firmware bug"), which I suppose might be the radeon microcode, so I can't get ABRT to create a nice automated retrace/full debug thing for me. But at least it still has stuff stored on disk, if there's anything in there you'd need:
# ls -lh /var/spool/abrt/oops-2015-12-10-21:50:22-777-1/ -rw-r----- 1 root abrt 5 10 déc 21:50 abrt_version -rw-r----- 1 root abrt 9 10 déc 21:50 analyzer -rw-r----- 1 root abrt 6 10 déc 21:50 architecture -rw-r----- 1 root abrt 3,7K 10 déc 21:50 backtrace -rw-r----- 1 root abrt 124 10 déc 21:50 cmdline -rw-r----- 1 root abrt 16 10 déc 21:50 component -rw-r----- 1 root abrt 1 10 déc 21:50 count -rw-r----- 1 root abrt 71K 10 déc 21:50 dmesg -rw-r----- 1 root abrt 40 10 déc 21:50 duphash -rw-r----- 1 root abrt 23 10 déc 21:50 extra-cc -rw-r----- 1 root abrt 8 10 déc 21:50 hostname -rw-r----- 1 root abrt 21 10 déc 21:50 kernel -rw-r----- 1 root abrt 25 10 déc 21:50 kernel_tainted_long -rw-r----- 1 root abrt 3 10 déc 21:50 kernel_tainted_short -rw-r----- 1 root abrt 10 10 déc 21:50 last_occurrence -rw-r----- 1 root abrt 173 10 déc 21:50 not-reportable -rw-r----- 1 root abrt 518 10 déc 21:50 os_info -rw-r----- 1 root abrt 32 10 déc 21:50 os_release -rw-r----- 1 root abrt 6 10 déc 21:50 package -rw-r----- 1 root abrt 7 10 déc 21:50 pkg_arch -rw-r----- 1 root abrt 2 10 déc 21:50 pkg_epoch -rw-r----- 1 root abrt 12 10 déc 21:50 pkg_name -rw-r----- 1 root abrt 9 10 déc 21:50 pkg_release -rw-r----- 1 root abrt 6 10 déc 21:50 pkg_version -rw-r----- 1 root abrt 4,4K 10 déc 21:50 proc_modules -rw-r----- 1 root abrt 37 10 déc 21:50 reason -rw-r----- 1 root abrt 8 10 déc 21:50 runlevel -rw-r----- 1 root abrt 269 10 déc 21:50 suspend_stats -rw-r----- 1 root abrt 10 10 déc 21:50 time -rw-r----- 1 root abrt 10 10 déc 21:50 type -rw-r----- 1 root abrt 40 10 déc 21:50 uuid
This is what I get in journalctl/dmesg:
-- Logs begin at lun 2015-11-30 21:48:19 EST, end at jeu 2015-12-10 23:48:33 EST. -- déc 10 21:49:00 the_PC kernel: radeon 0000:02:00.0: ring 3 stalled for more than 10115msec déc 10 21:49:00 the_PC kernel: radeon 0000:02:00.0: GPU lockup (current fence id 0x000000000000a5fe last fence id 0x000000000000a600 on ring 3) déc 10 21:49:01 the_PC kernel: BUG: unable to handle kernel paging request at ffffc90404239ffc déc 10 21:49:01 the_PC kernel: IP: [<ffffffffa00f850a>] radeon_ring_backup+0xda/0x190 [radeon] déc 10 21:49:01 the_PC kernel: PGD 6068a8067 PUD 0 déc 10 21:49:01 the_PC kernel: Oops: 0000 [#1] SMP déc 10 21:49:01 the_PC kernel: Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable déc 10 21:49:01 the_PC kernel: radeon i2c_algo_bit drm_kms_helper ttm drm serio_raw déc 10 21:49:01 the_PC kernel: CPU: 3 PID: 153 Comm: kworker/u64:7 Tainted: G I 4.2.6-301.fc23.x86_64 #1 déc 10 21:49:01 the_PC kernel: Hardware name: Dell Inc. Precision WorkStation T3500 /0K095G, BIOS A17 05/28/2013 déc 10 21:49:01 the_PC kernel: Workqueue: radeon-crtc radeon_flip_work_func [radeon] déc 10 21:49:01 the_PC kernel: task: ffff88060299b880 ti: ffff8805ff5c0000 task.ti: ffff8805ff5c0000 déc 10 21:49:01 the_PC kernel: RIP: 0010:[<ffffffffa00f850a>] [<ffffffffa00f850a>] radeon_ring_backup+0xda/0x190 [radeon] déc 10 21:49:01 the_PC kernel: RSP: 0018:ffff8805ff5c3c98 EFLAGS: 00010206 déc 10 21:49:01 the_PC kernel: RAX: ffffc9000fe50000 RBX: 00000000ffffffff RCX: 0000000000000000 déc 10 21:49:01 the_PC kernel: RDX: 0000000000000000 RSI: ffffc90404239ffc RDI: 0000000000080500 déc 10 21:49:01 the_PC kernel: RBP: ffff8805ff5c3cd8 R08: ffff8805771f8cc0 R09: 0000000000082000 déc 10 21:49:01 the_PC kernel: R10: 8000000000000163 R11: ffffffff81a609e9 R12: ffff880036a654d8 déc 10 21:49:01 the_PC kernel: R13: ffff880036a654b0 R14: 0000000000020141 R15: ffff8805ff5c3d30 déc 10 21:49:01 the_PC kernel: FS: 0000000000000000(0000) GS:ffff880606ec0000(0000) knlGS:0000000000000000 déc 10 21:49:01 the_PC kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b déc 10 21:49:01 the_PC kernel: CR2: ffffc90404239ffc CR3: 0000000001c0b000 CR4: 00000000000006e0 déc 10 21:49:01 the_PC kernel: Stack: déc 10 21:49:01 the_PC kernel: ffff8805ff5c3cc8 ffffffffa00f9413 ffff880036a64000 ffff880036a64000 déc 10 21:49:01 the_PC kernel: ffff880036a654d8 ffff8805ff5c3d30 ffff880036a654d8 0000000000000000 déc 10 21:49:01 the_PC kernel: ffff8805ff5c3da8 ffffffffa00c6c80 ffffffff810df990 ffff880036a64738 déc 10 21:49:01 the_PC kernel: Call Trace: déc 10 21:49:01 the_PC kernel: [<ffffffffa00f9413>] ? radeon_irq_kms_disable_hpd+0x73/0x80 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffffa00c6c80>] radeon_gpu_reset+0xd0/0x330 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffff810df990>] ? wake_atomic_t_function+0x70/0x70 déc 10 21:49:01 the_PC kernel: [<ffffffffa00e058f>] ? radeon_fence_wait+0x9f/0xe0 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffffa00ed960>] radeon_flip_work_func+0x130/0x170 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffff810b650e>] process_one_work+0x19e/0x3f0 déc 10 21:49:01 the_PC kernel: [<ffffffff810b67ae>] worker_thread+0x4e/0x450 déc 10 21:49:01 the_PC kernel: [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0 déc 10 21:49:01 the_PC kernel: [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0 déc 10 21:49:01 the_PC kernel: [<ffffffff810bc8b8>] kthread+0xd8/0xf0 déc 10 21:49:01 the_PC kernel: [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160 déc 10 21:49:01 the_PC kernel: [<ffffffff817797df>] ret_from_fork+0x3f/0x70 déc 10 21:49:01 the_PC kernel: [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160 déc 10 21:49:01 the_PC kernel: Code: 10 e1 48 85 c0 49 89 07 74 6c 41 8d 7e ff 31 d2 48 c1 e7 02 eb 07 49 8b 07 48 83 c2 04 49 8b 74 24 08 8d 4b 01 89 db 48 8d 34 9e <8b> 36 89 34 10 41 23 4c 24 54 48 39 d7 89 cb 75 da 4c 89 ef e8 déc 10 21:49:01 the_PC kernel: RIP [<ffffffffa00f850a>] radeon_ring_backup+0xda/0x190 [radeon] déc 10 21:49:01 the_PC kernel: RSP <ffff8805ff5c3c98> déc 10 21:49:01 the_PC kernel: CR2: ffffc90404239ffc déc 10 21:49:01 the_PC kernel: ---[ end trace 37e2470f6b251992 ]--- déc 10 21:49:01 the_PC kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8 déc 10 21:49:01 the_PC kernel: IP: [<ffffffff810bcd40>] kthread_data+0x10/0x20 déc 10 21:49:01 the_PC kernel: PGD 1c0e067 PUD 1c10067 PMD 0 déc 10 21:49:01 the_PC kernel: Oops: 0000 [#2] SMP déc 10 21:49:01 the_PC kernel: Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable déc 10 21:49:01 the_PC kernel: radeon i2c_algo_bit drm_kms_helper ttm drm serio_raw déc 10 21:49:01 the_PC kernel: CPU: 3 PID: 153 Comm: kworker/u64:7 Tainted: G D I 4.2.6-301.fc23.x86_64 #1 déc 10 21:49:01 the_PC kernel: Hardware name: Dell Inc. Precision WorkStation T3500 /0K095G, BIOS A17 05/28/2013 déc 10 21:49:01 the_PC kernel: task: ffff88060299b880 ti: ffff8805ff5c0000 task.ti: ffff8805ff5c0000 déc 10 21:49:01 the_PC kernel: RIP: 0010:[<ffffffff810bcd40>] [<ffffffff810bcd40>] kthread_data+0x10/0x20 déc 10 21:49:01 the_PC kernel: RSP: 0018:ffff8805ff5c3918 EFLAGS: 00010096 déc 10 21:49:01 the_PC kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000005 déc 10 21:49:01 the_PC kernel: RDX: 0000000000000005 RSI: 0000000000000003 RDI: ffff88060299b880 déc 10 21:49:01 the_PC kernel: RBP: ffff8805ff5c3918 R08: ffff88060299b910 R09: 0000000000000000 déc 10 21:49:01 the_PC kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000167c0 déc 10 21:49:01 the_PC kernel: R13: ffff88060299b880 R14: ffff880606ed67c0 R15: 0000000000000003 déc 10 21:49:01 the_PC kernel: FS: 0000000000000000(0000) GS:ffff880606ec0000(0000) knlGS:0000000000000000 déc 10 21:49:01 the_PC kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b déc 10 21:49:01 the_PC kernel: CR2: 0000000000000028 CR3: 0000000001c0b000 CR4: 00000000000006e0 déc 10 21:49:01 the_PC kernel: Stack: déc 10 21:49:01 the_PC kernel: ffff8805ff5c3938 ffffffff810b7385 ffff8805ff5c3938 ffff880606ed67c0 déc 10 21:49:01 the_PC kernel: ffff8805ff5c3988 ffffffff81774fc0 ffff880500000000 ffff88060299b880 déc 10 21:49:01 the_PC kernel: ffff8805ff5c3988 ffff8805ff5c4000 ffff8805ff5c39f0 ffff8805ff5c39f0 déc 10 21:49:01 the_PC kernel: Call Trace: déc 10 21:49:01 the_PC kernel: [<ffffffff810b7385>] wq_worker_sleeping+0x15/0xa0 déc 10 21:49:01 the_PC kernel: [<ffffffff81774fc0>] __schedule+0x620/0x950 déc 10 21:49:01 the_PC kernel: [<ffffffff81775327>] schedule+0x37/0x80 déc 10 21:49:01 the_PC kernel: [<ffffffff810a103a>] do_exit+0x80a/0xae0 déc 10 21:49:01 the_PC kernel: [<ffffffff810180fe>] oops_end+0x9e/0xd0 déc 10 21:49:01 the_PC kernel: [<ffffffff81064c25>] no_context+0x135/0x380 déc 10 21:49:01 the_PC kernel: [<ffffffff81064ef0>] __bad_area_nosemaphore+0x80/0x1f0 déc 10 21:49:01 the_PC kernel: [<ffffffff81065073>] bad_area_nosemaphore+0x13/0x20 déc 10 21:49:01 the_PC kernel: [<ffffffff81065357>] __do_page_fault+0xb7/0x400 déc 10 21:49:01 the_PC kernel: [<ffffffff810656cf>] do_page_fault+0x2f/0x80 déc 10 21:49:01 the_PC kernel: [<ffffffff8177b378>] page_fault+0x28/0x30 déc 10 21:49:01 the_PC kernel: [<ffffffffa00f850a>] ? radeon_ring_backup+0xda/0x190 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffffa00f85b0>] ? radeon_ring_backup+0x180/0x190 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffffa00f9413>] ? radeon_irq_kms_disable_hpd+0x73/0x80 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffffa00c6c80>] radeon_gpu_reset+0xd0/0x330 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffff810df990>] ? wake_atomic_t_function+0x70/0x70 déc 10 21:49:01 the_PC kernel: [<ffffffffa00e058f>] ? radeon_fence_wait+0x9f/0xe0 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffffa00ed960>] radeon_flip_work_func+0x130/0x170 [radeon] déc 10 21:49:01 the_PC kernel: [<ffffffff810b650e>] process_one_work+0x19e/0x3f0 déc 10 21:49:01 the_PC kernel: [<ffffffff810b67ae>] worker_thread+0x4e/0x450 déc 10 21:49:01 the_PC kernel: [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0 déc 10 21:49:01 the_PC kernel: [<ffffffff810b6760>] ? process_one_work+0x3f0/0x3f0 déc 10 21:49:01 the_PC kernel: [<ffffffff810bc8b8>] kthread+0xd8/0xf0 déc 10 21:49:01 the_PC kernel: [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160 déc 10 21:49:01 the_PC kernel: [<ffffffff817797df>] ret_from_fork+0x3f/0x70 déc 10 21:49:01 the_PC kernel: [<ffffffff810bc7e0>] ? kthread_worker_fn+0x160/0x160 déc 10 21:49:01 the_PC kernel: Code: c4 08 44 89 e8 5b 41 5c 41 5d 5d c3 4c 89 e7 e8 e7 eb fd ff eb 88 0f 1f 44 00 00 66 66 66 66 90 48 8b 87 90 05 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 déc 10 21:49:01 the_PC kernel: RIP [<ffffffff810bcd40>] kthread_data+0x10/0x20 déc 10 21:49:01 the_PC kernel: RSP <ffff8805ff5c3918> déc 10 21:49:01 the_PC kernel: CR2: ffffffffffffffd8 déc 10 21:49:01 the_PC kernel: ---[ end trace 37e2470f6b251993 ]--- déc 10 21:49:01 the_PC kernel: Fixing recursive fault but reboot is needed! -- Reboot --
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #1 from Jean-François Fortin Tam nekohayo@gmail.com --- I also get it to (rarely) lockup when not doing anything in particular. I could be just sitting and staring at my desktop when suddenly the monitor turns off and I get this in dmesg:
[67967.108746] radeon 0000:02:00.0: ring 0 stalled for more than 10252msec [67967.108750] radeon 0000:02:00.0: GPU lockup (current fence id 0x00000000006c9132 last fence id 0x00000000006c928b on ring 0) [67967.108772] radeon 0000:02:00.0: failed to get a new IB (-35) [67967.108805] [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib ! [67967.977163] BUG: unable to handle kernel paging request at ffffc90404239ffc [67967.977200] IP: [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon] [67967.977246] PGD 6068a8067 PUD 0 [67967.977271] Oops: 0000 [#1] SMP [67967.977293] Modules linked in: fuse xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_analog snd_hda_codec_generic dell_wmi iTCO_wdt sparse_keymap gpio_ich iTCO_vendor_support video ppdev coretemp kvm_intel dcdbas snd_hda_codec_hdmi dell_smm_hwmon kvm snd_hda_intel snd_hda_codec snd_usb_audio snd_hda_core crc32c_intel snd_usbmidi_lib snd_hwdep snd_seq snd_rawmidi snd_seq_device joydev snd_pcm snd_timer snd tpm_tis lpc_ich parport_pc i2c_i801 soundcore tpm parport wmi i7core_edac shpchp edac_core acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc hid_logitech_hidpp hid_logitech_dj wacom amdkfd amd_iommu_v2 [67967.977806] radeon i2c_algo_bit drm_kms_helper ttm tg3 serio_raw drm ptp pps_core [67967.977875] CPU: 5 PID: 5985 Comm: Xorg Tainted: G I 4.3.3-301.fc23.x86_64 #1 [67967.977906] Hardware name: Dell Inc. Precision WorkStation T3500 /0K095G, BIOS A17 05/28/2013 [67967.977937] task: ffff8805e5a11cc0 ti: ffff8805e8038000 task.ti: ffff8805e8038000 [67967.977965] RIP: 0010:[<ffffffffa013736a>] [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon] [67967.978013] RSP: 0018:ffff8805e803ba28 EFLAGS: 00010206 [67967.978033] RAX: ffffc9000c001000 RBX: 00000000ffffffff RCX: 0000000000000000 [67967.978059] RDX: 0000000000000000 RSI: ffffc90404239ffc RDI: 00000000000b0bc0 [67967.978086] RBP: ffff8805e803ba58 R08: ffff8803c68b3880 R09: 00000000000b2000 [67967.978112] R10: 8000000000000163 R11: ffffffff81a68139 R12: ffff8805ff2a54d8 [67967.978138] R13: ffff8805ff2a54b0 R14: 000000000002c2f1 R15: ffff8805e803baa0 [67967.978164] FS: 00007f5fb263f700(0000) GS:ffff880606f40000(0000) knlGS:0000000000000000 [67967.978194] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [67967.978215] CR2: ffffc90404239ffc CR3: 00000005e5b48000 CR4: 00000000000006e0 [67967.978241] Stack: [67967.978250] ffff8805ff2a4000 ffff8805ff2a4000 ffff8805ff2a54d8 ffff8805e803baa0 [67967.978287] ffff8805ff2a54d8 0000000000000000 ffff8805e803bb10 ffffffffa0105c8d [67967.978322] ffff8805ff2a4738 00ffffff00000001 ffff8805ff2a4018 0000000000000000 [67967.978359] Call Trace: [67967.978377] [<ffffffffa0105c8d>] radeon_gpu_reset+0xcd/0x330 [radeon] [67967.978415] [<ffffffffa01dec7f>] ? radeon_sync_free+0x2f/0x40 [radeon] [67967.978452] [<ffffffffa01de547>] ? radeon_ib_free+0x37/0x40 [radeon] [67967.978488] [<ffffffffa0138df4>] radeon_cs_ioctl+0x64/0x780 [radeon] [67967.978520] [<ffffffffa0019408>] drm_ioctl+0x138/0x500 [drm] [67967.978552] [<ffffffffa0138d90>] ? radeon_cs_parser_init+0x490/0x490 [radeon] [67967.978586] [<ffffffff8178108e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 [67967.978618] [<ffffffffa010304c>] radeon_drm_ioctl+0x4c/0x80 [radeon] [67967.978647] [<ffffffff81236bd5>] do_vfs_ioctl+0x295/0x470 [67967.978671] [<ffffffff8111e941>] ? SyS_futex+0x81/0x180 [67967.978692] [<ffffffff81236e29>] SyS_ioctl+0x79/0x90 [67967.978712] [<ffffffff817815ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [67967.978735] Code: 0c e1 48 85 c0 49 89 07 74 6c 41 8d 7e ff 31 d2 48 c1 e7 02 eb 07 49 8b 07 48 83 c2 04 49 8b 74 24 08 8d 4b 01 89 db 48 8d 34 9e <8b> 36 89 34 10 41 23 4c 24 54 48 39 d7 89 cb 75 da 4c 89 ef e8 [67967.979054] RIP [<ffffffffa013736a>] radeon_ring_backup+0xda/0x190 [radeon] [67967.979094] RSP <ffff8805e803ba28> [67967.979108] CR2: ffffc90404239ffc [67967.988284] ---[ end trace f6fe8c1dbb2ed43c ]--- [68043.714679] Chrome_ChildThr[29558]: segfault at 0 ip 0000557f813adea4 sp 00007fa9867fe3e0 error 6 in plugin-container[557f813a5000+3d000]
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #2 from Andreas Kilgus kilgus@fuenfsieben.de --- Created attachment 121831 --> https://bugs.freedesktop.org/attachment.cgi?id=121831&action=edit Excerpt /var/log/messages GPU crash
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #3 from Andreas Kilgus kilgus@fuenfsieben.de --- Happens at low system/graphical load, maybe related to chromium (IIRC, the last two times it occured I was actively using chromium).
Radeon R7 260X
Mesa 11.1.2 xorg-x11-server 7.6 1.18.1 kernel 4.4.1
https://bugs.freedesktop.org/show_bug.cgi?id=93341
Jean-François Fortin Tam nekohayo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|GPU lockups on RadeonHD |GPU lockups on RadeonHD |7770 (radeonsi driver) when |7770 (radeonsi driver) when |running OpenGL games |running OpenGL games or | |after extended periods of | |time
--- Comment #4 from Jean-François Fortin Tam nekohayo@gmail.com --- Happened to me again today after 1 day and 22 hours of uptime, with the computer just sitting around, idle, with the screen turned of. It can sometimes happen after 6 days, sometimes 1-2 days... doesn't matter what you're doing or not.
At least this time I've been able to eliminate "suspend/resume" from the list of potential causes, as the computer was set to never sleep.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #5 from Jean-François Fortin Tam nekohayo@gmail.com --- And it's not triggered by Chromium/Epiphany/Firefox, it happens with just a GNOME desktop sitting around in my case. Clearly, something is just FUBAR in the radeon driver or recent Linux kernels...
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #6 from Jean-François Fortin Tam nekohayo@gmail.com --- For what it's like, compared to my previous comment #5, tonight (same machine, same distro/stack) I was able to trigger the bug pretty frequently by using the Epiphany browser with a particular website—twice within the span of fifteen minutes or so.
So while there is a simple time component (ex: crashes while the computer isn't doing anything in particular), it can also sometimes be triggered by stressing the graphic card a little with some operations (such as can be seen on some browsers).
https://bugs.freedesktop.org/show_bug.cgi?id=93341
Jean-François Fortin Tam nekohayo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- URL| |https://bugzilla.redhat.com | |/show_bug.cgi?id=1335360
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #7 from Jean-François Fortin Tam nekohayo@gmail.com --- Um hello, any developers around?
As previously mentioned, although it happens even when idle, it's quite easy to trigger and reproduce by using 3D/openGL content. And it's extremely easy to trigger with http://demo.f4map.com/#lat=45.4946369&lon=-73.5661827&zoom=19 ; just have to sit around that page for a minute or two, maybe pan around the map, and your driver (and kernel) will crash with the screen turning off.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #8 from Nicolai Hähnle nhaehnle@gmail.com --- Sorry for your troubles. Non-deterministic lockups are just very hard to debug, and silence mostly means that nobody has an idea.
For future record, which browser reproduces the lockup for you on that website?
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #9 from Jean-François Fortin Tam nekohayo@gmail.com --- Hi Nicolai, it's more the lack of response that bothered me after half a year, I was really looking forward to providing any information that might be needed to investigate this bug, but trying to work for six months with a workstation that can hardlock at any time is really painful :)
I can see now that it is a somewhat non-deterministic bug indeed. I have been using the latest version of Firefox (v47+) on Fedora 23 and 24 today to trigger the bug easily (usually within 3-10 minutes) by having these pages open all at the same time (what better torture test than a bunch of WebGL demos!):
- appear.in/fdo93341 - demo.f4map.com - bongiovi.tw/projects/particlesValley/ - jayweeks.com/medusae/
...with a RadeonHD 2600 (instead of the 7770) the bug does not occur so far, but that's a completely different series (r600 instead of radeonsi) so I'm not surprised.
FWIW, this Dell workstation-class computer has a pretty powerful PSU (525w) compared to the one of the previous computer I was on with the Radeon 7770 (which had a 350w PSU). I measured the GPU's temperatures at all times (nothing unusual going on), tried different PCI-E slots (since my workstation has two), no luck...
https://bugs.freedesktop.org/show_bug.cgi?id=93341
Nicolai Hähnle nhaehnle@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |deathsimple@vodafone.de
--- Comment #10 from Nicolai Hähnle nhaehnle@gmail.com --- I've been running the last three in Firefox on a Tonga system that was simultaneously used for other tests for 45 minutes now, without a hang.
@Christian: It's a long shot, but by the rough shape of GPU lockup reports over the last few months I have the impression that the radeon module still has a lockup bug under pressure (especially with multiple apps running simultaneously, but that might just be X/the compositor) which was fixed in amdgpu. Any idea what that might have been?
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #11 from Jean-François Fortin Tam nekohayo@gmail.com --- You are right Nicolai, the stressor to trigger the bug is more subtle than I thought after all... while I was able to trigger this within minutes a few days ago, now my machine has been running with those 3-4 webGL benchmarks for the entire day today without issues.
Just to make sure it's really not a hardware issue, I tried with different power supplies, I measured the consumption (the machine eats between 150 and 220 watts at the very maximum, whereas the PSU can easily supply 500 watts), and tested the "Other OS", which doesn't exhibit the issue... so it does still look like a software bug, at least. I'd be happy to provide any other info you may need.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #12 from Arek Ruśniak arek.rusi@gmail.com --- Hi, I have HD 7770 too and your problem sounds familiar. I use gnome3 as well. I use mesa/llvm from git master tree all the time. Sometimes clicking at "activites" was enough to gpu went "bunga bunga" but sometimes it was stable as hell. It was extremly random and no trigger for that I found but some times ago I don't remember exactly (half year or so) problem disappeared.
If you are still using mesa from fedora (I can't see what version is) maybe it's time to consider changes. There is repo with mesa-git for fedora (against llvm 3.8). It could be good start.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
Jean-François Fortin Tam nekohayo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|GPU lockups on RadeonHD |GPU lockups on RadeonHD |7770 (radeonsi driver) when |7770 (radeonsi driver) when |running OpenGL games or |running OpenGL games, WebGL |after extended periods of |apps, or after extended |time |periods of time
--- Comment #13 from Jean-François Fortin Tam nekohayo@gmail.com --- Just to be 110% sure: I put in a completely new, top-quality 650w power supply into the machine, and the problem persists with the F4 map webgl demo.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
Jean-François Fortin Tam nekohayo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|GPU lockups on RadeonHD |Semi-random GPU lockups on |7770 (radeonsi driver) when |radeonsi with a RadeonHD |running OpenGL games, WebGL |7770 (when playing videos, |apps, or after extended |running OpenGL games, WebGL |periods of time |apps, or after extended | |periods of time)
--- Comment #14 from Jean-François Fortin Tam nekohayo@gmail.com --- As an update/additional info: the problem persists on Fedora 25 running a Wayland-based GNOME. I don't know how to determine the driver's version number but I presume it to be the latest released at this time.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #15 from Vedran Miletić vedran@miletic.net --- (In reply to Jean-François Fortin Tam from comment #14)
As an update/additional info: the problem persists on Fedora 25 running a Wayland-based GNOME. I don't know how to determine the driver's version number but I presume it to be the latest released at this time.
Do you get the same dmesg errors? I have Wayland locking up randomly, but dmesg stays clean and I can ssh into the machine and reboot.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #16 from Jean-François Fortin Tam nekohayo@gmail.com --- Created attachment 128278 --> https://bugs.freedesktop.org/attachment.cgi?id=128278&action=edit journalctl output at the time of a deadlock on F25
Do you get the same dmesg errors? I have Wayland locking up randomly, but dmesg stays clean and I can ssh into the machine and reboot.
Pretty much yeah. Attached is the crash I have experienced just now, and the computer wasn't doing anything other than sitting around on the desktop and playing music from Rhythmbox... and you can see the usual:
/usr/libexec/gdm-x-session[18145]: radeon: Failed to deallocate virtual address for buffer: /usr/libexec/gdm-x-session[18145]: radeon: size : 20480 bytes kernel: radeon 0000:02:00.0: ring 3 stalled for more than 10083msec kernel: radeon 0000:02:00.0: GPU lockup (current fence id 0x00000000002d46ee last fence id 0x00000000002d4710 on ring 3) kernel: radeon 0000:02:00.0: failed to get a new IB (-35) kernel: [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35) kernel: radeon 0000:02:00.0: failed to get a new IB (-35) kernel: radeon 0000:02:00.0: failed to get a new IB (-35) kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib ! kernel: [drm:radeon_gem_va_ioctl [radeon]] *ERROR* Couldn't update BO_VA (-35) /usr/libexec/gdm-x-session[18145]: radeon: va : 0x1f836000 /usr/libexec/gdm-x-session[18145]: radeon: Failed to deallocate virtual address for buffer: /usr/libexec/gdm-x-session[18145]: radeon: size : 45056 bytes /usr/libexec/gdm-x-session[18145]: radeon: va : 0x1f4b0000
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #17 from Jean-François Fortin Tam nekohayo@gmail.com --- I'm unfortunately still seeing this on an up-to-date Fedora 25 with kernel 4.9.6, DRM 2.48.0, LLVM 3.8.1, mesa 13.0.3, xorg-x11-drv-ati 7.7.1 (2016-09-28 git 3fc839ff) etc.
Nicolai, would it help at all to know that I don't recall ever encountering the issue while playing non-fullscreened HTML5 youtube videos in Firefox, but that I can easily encounter it if playing fullscreen or if playing fullscreen videos in Totem (under GNOME Shell, whether Xorg or Wayland session)?
This really doesn't seem related to system load, I was looking at "radeontop" just now while playing a fullscreen video (which made it deadlock within a few minutes) and the graphics pipe was barely 20-30% used, and VRAM about 80-90% used but never 100%.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #18 from Michel Dänzer michel@daenzer.net --- Please attach the current Xorg log file.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #19 from Jean-François Fortin Tam nekohayo@gmail.com --- What would be the equivalent in the systemd/journalctl world? Apparently Fedora 25 doesn't generate Xorg.log files anymore, the last modification timestamp on that one file is october 10th 2016...
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #20 from Alex Deucher alexdeucher@gmail.com --- (In reply to Jean-François Fortin Tam from comment #19)
What would be the equivalent in the systemd/journalctl world? Apparently Fedora 25 doesn't generate Xorg.log files anymore, the last modification timestamp on that one file is october 10th 2016...
See this page for how to access the xorg log output on various versions of fedora: https://fedoraproject.org/wiki/How_to_debug_Xorg_problems
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #21 from Jean-François Fortin Tam nekohayo@gmail.com --- Created attachment 129304 --> https://bugs.freedesktop.org/attachment.cgi?id=129304&action=edit journalctl output at the time of a deadlock on F25 - X GDM session output only
Hi Alex and thanks for the pointer, here's the output as per those instructions... but the result seems quite useless compared to the full journalctl output (which I'll be attaching as well).
https://bugs.freedesktop.org/show_bug.cgi?id=93341
Jean-François Fortin Tam nekohayo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #128278|0 |1 is obsolete| |
--- Comment #22 from Jean-François Fortin Tam nekohayo@gmail.com --- Created attachment 129305 --> https://bugs.freedesktop.org/attachment.cgi?id=129305&action=edit journalctl output at the time of a deadlock on F25 - take 2
Full journal output at the time of the crash. Exactly the same as before as far as I can tell. If there's any other information I can provide, please tell.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #23 from Jean-François Fortin Tam nekohayo@gmail.com --- Created attachment 129306 --> https://bugs.freedesktop.org/attachment.cgi?id=129306&action=edit Xorg log
Xorg.0.log file found in ~/.local/share/xorg as "Xorg.0.log.old" As you can see it says nothing about the crash. It seems only the global journalctl output caught something.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
Michel Dänzer michel@daenzer.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #129306|text/x-log |text/plain mime type| |
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #24 from Julien Isorce julien.isorce@gmail.com --- Does the test wget http://www.phoronix-test-suite.com/benchmark-files/GpuTest_Linux_x64_0.7.0.z... DISPLAY=:0 ./GpuTest /test=fur /fullscreen
reproduce the problem ?
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #25 from Jean-François Fortin Tam nekohayo@gmail.com --- Hi Julien, unfortunately with that benchmark I was not able to reproduce it so far (I've had it running for about 9 hours). This might be just "luck" though, as I've sometimes had the issue refuse to reproduce for hours and days, and sometimes the issue would happen right away. As I'm suspecting it's a race condition, I'm thinking it might also be sensitive to the system's software collection at various times of the year (i.e. maybe with one kernel the problem resurfaces more frequently, then another point kernel releases changes the a bit the stack's timings and the race disappears, rinse & repeat?)
I might as well leave the benchmark running in the coming days, but at least you know that it's (probably) not directly due to the system load or the GPU load... as I mentioned in earlier comments, it seems to be quite random. For some reason, I haven't encountered a random lockup in a month, although I've grown to use my computer in light ways (too scared to play fullscreen videos or use 3D, except composited window managers)
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #26 from Jean-François Fortin Tam nekohayo@gmail.com --- OK, I've got good news... Julien, thanks to the crazy furry donut "torture test" you suggested, I was able to finally pinpoint the real trigger for this bug.
My understanding is that on Radeons (well, at least the Radeon HD 7770), there is an emergency mechanism in the hardware (or firmware/microcode maybe) that activates self-throttling of performances when the GPU reaches a critical temperature. Normally, the video driver is supposed to handle this state change gracefully, however the radeonsi/radeon/amdgpu driver on Linux does not, so the kernel panics because the driver went belly up.
During additional testing today, where I forced my GPU to overheat, I was able to determine that the critical point is the same as on Windows: 113 degrees Celsius. As soon as you go over 112... boom, dead radeonsi driver + kernel oops (with the same error messages as my previous logs above). Additionally, lm_sensors thinks the temperature has instantly jumped to 511 degrees Celsius (!), and the readings stay stuck at 511 Celsius.
"Duh! Just get better cooling!" might sound like a workaround (just like keeping the case open), but nope, technically, it's still a software/driver issue: the Linux driver should handle such scenarios gracefully just as well as the Windows driver. In Windows, breaching the 110-113 degrees Celsius limit results in the video driver simply dropping frames massively, continuing to function at reduced performance (ie: going from 40-60 fps to 10-15 fps on one of my benchmarks). The system never crashes.
So the bug here, as I understand it, is that the radeonsi driver on Linux does not handle the event where the hardware force-throttles itself.
--------- Contextual notes: The reason why I only started experiencing this issue in December 2015 (as I've had the GPU since 2012) was that I changed my PC case then, which means a different airflow and cooling behavior... And the reason why it was so hard to get consistent crashes here was that when I was trying to troubleshoot it, I was sometimes doing it with the case closed, sometimes with the case open (when trying with a different power supply unit using a "siamese transplant" across another computer, for example). If I keep my case open, the card will never reach the critical temperature and so the issue will not happen. I might get a system "freeze" (possibly saying "*ERROR* si_restrict_performance_levels_before_switch failed") after many hours of torture testing, but the symptoms are different (the screen does not turn off, image stays on with everything frozen, and nothing else in the logs) and so I presume that to be a different issue.
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #27 from Julien Isorce julien.isorce@gmail.com --- About your comment #26, do you get similar logs than those attached ? i.e. ring N stalled then gpu softreset then freeze which requires reboot ?
Can you try https://bugs.freedesktop.org/show_bug.cgi?id=100712#c6 ?
https://bugs.freedesktop.org/show_bug.cgi?id=93341
--- Comment #28 from Jean-François Fortin Tam nekohayo@gmail.com --- Hi Julien, sorry I missed the mail notification in the pile. To answer your question:
About your comment #26, do you get similar logs than those attached ? i.e. ring N stalled then gpu softreset then freeze which requires reboot ?
Yeah I was getting the exact same output as usual (forgot to mention that).
Can you try https://bugs.freedesktop.org/show_bug.cgi?id=100712#c6 ?
Not easily as I'd have to wait for that to trickle down into whatever kernel Fedora is packaging and compare versions, and would need to be able to make my GPU overheat which is no longer easy since I completely changed the thermal design and ventilation of my case (even under 100% GPU load it stays under 60-70 Celsius now).
Though maybe Andreas or Arek could also try this, if they have a similar issue with an "open air" GPU fan design that exhausts into a not-so-well-ventilated case (instead of a "blower" GPU cooler that directly extracts the hot air)...
https://bugs.freedesktop.org/show_bug.cgi?id=93341
GitLab Migration User gitlab-migration@fdo.invalid changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |MOVED
--- Comment #29 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1226.
dri-devel@lists.freedesktop.org