https://bugs.freedesktop.org/show_bug.cgi?id=104299
Bug ID: 104299 Summary: Crash on amdgpu_sync_get_fence Product: DRI Version: XOrg git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: higuita@gmx.net
During the past week i got amdgpu 2 crashes, both with this stack:
Dec 17 02:54:42 Couracado kernel: [69955.112339] Oops: 0000 [#1] SMP Dec 17 02:54:42 Couracado kernel: [69955.138598] Modules linked in: uinput snd_usb_audio snd_usbmidi_lib snd_rawmidi f71882fg ipt_ECN snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_mangle ip6table_filter ip6_tables xt_DSCP nf_nat_irc nf_nat nf_conntrack_irc nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_mangle iptable_filter ip_tables x_tables bridge stp llc ipv6 nls_iso8859_1 nls_cp437 vfat fat reiserfs sch_fq_codel pcspkr fuse joydev hid_generic snd_hda_codec_hdmi usbhid hid eeepc_wmi tuner_simple tuner_types tea5767 tuner tda7432 snd_hda_codec_realtek tvaudio snd_hda_codec_generic msp3400 snd_hda_intel snd_hda_codec Dec 17 02:54:42 Couracado kernel: [69955.735663] asus_wmi snd_hwdep sparse_keymap bttv tea575x snd_hda_core i2c_dev rfkill wmi_bmof tveeprom crct10dif_pclmul snd_pcm videobuf_dma_sg videobuf_core amdkfd crc32_pclmul rc_core evdev efi_pstore crc32c_intel r8169 v4l2_common ghash_clmulni_intel amd_iommu_v2 serio_raw efivars fam15h_power k10temp snd_timer mii ohci_pci videodev i2c_piix4 snd amdgpu ehci_pci soundcore ohci_hcd ehci_hcd mfd_core parport_pc hwmon xhci_pci ttm parport wmi xhci_hcd video shpchp button acpi_cpufreq loop Dec 17 02:54:42 Couracado kernel: [69956.099719] CPU: 1 PID: 814 Comm: gfx Not tainted 4.14.6-slack #6 Dec 17 02:54:42 Couracado kernel: [69956.150725] Hardware name: System manufacturer System Product Name/A88X-PLUS, BIOS 3003 03/10/2016 Dec 17 02:54:42 Couracado kernel: [69956.225762] task: ffff884c3d508100 task.stack: ffffb665439b0000 Dec 17 02:54:42 Couracado kernel: [69956.275368] RIP: 0010:amdgpu_sync_get_fence+0x91/0xe0 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69956.324197] RSP: 0018:ffffb665439b3e20 EFLAGS: 00010246 Dec 17 02:54:42 Couracado kernel: [69956.367931] RAX: 00000000002ae450 RBX: ffff884ab449db60 RCX: 0000000000000000 Dec 17 02:54:42 Couracado kernel: [69956.427677] RDX: 0000000000000064 RSI: ffff884b534e8540 RDI: ffff884c46000e00 Dec 17 02:54:42 Couracado kernel: [69956.487426] RBP: ffffb665439b3e40 R08: 0000000000000008 R09: 0000000000000010 Dec 17 02:54:42 Couracado kernel: [69956.547172] R10: 0000000000000255 R11: 000000000000019f R12: 0000000000000000 Dec 17 02:54:42 Couracado kernel: [69956.606922] R13: ffff884767dbc900 R14: ffff884767dbc968 R15: ffff8848d44b8bd8 Dec 17 02:54:42 Couracado kernel: [69956.666669] FS: 0000000000000000(0000) GS:ffff884c5ec80000(0000) knlGS:0000000000000000 Dec 17 02:54:42 Couracado kernel: [69956.734426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 17 02:54:42 Couracado kernel: [69956.782525] CR2: 00000000002ae468 CR3: 000000011da6a000 CR4: 00000000000406e0 Dec 17 02:54:42 Couracado kernel: [69956.842274] Call Trace: Dec 17 02:54:42 Couracado kernel: [69956.862764] amdgpu_job_dependency+0x93/0x100 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69956.905816] amd_sched_main+0xb5/0x450 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69956.943730] ? wait_woken+0x80/0x80 Dec 17 02:54:42 Couracado kernel: [69956.972902] kthread+0x125/0x140 Dec 17 02:54:42 Couracado kernel: [69956.999935] ? amd_sched_process_job+0xc0/0xc0 [amdgpu] Dec 17 02:54:42 Couracado kernel: [69957.043674] ? kthread_create_on_node+0x70/0x70 Dec 17 02:54:42 Couracado kernel: [69957.081583] ret_from_fork+0x22/0x30 Dec 17 02:54:42 Couracado kernel: [69957.111479] Code: 89 44 24 08 48 c7 06 00 00 00 00 48 c7 46 08 00 00 00 00 48 8b 3d d8 47 15 00 e8 ab 94 d3 da 48 8b 43 48 a8 01 75 9b 48 8b 43 08 <48> 8b 40 18 48 85 c0 74 09 48 89 df ff d0 84 c0 75 0c 48 89 d8 Dec 17 02:54:42 Couracado kernel: [69957.330761] CR2: 00000000002ae468 Dec 17 02:54:42 Couracado kernel: [69957.358479] ---[ end trace da8374d3133f4c24 ]--- Dec 17 02:54:42 Couracado kernel: [69957.397138] sched: RT throttling activated
It is rare, so hard to reproduce, but as amdgpu have been stable for me in the last 6 months, i would say it's something with the latest kernel or mesa code. i'm using kernel 4.14.6, drm 2.4.88, mesa 17.3.0, llvm 5.0.0
thanks
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #1 from Christian König ckoenig.leichtzumerken@gmail.com --- Please add the full dmesg output as attachment.
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #2 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Hi, have you noticed any specific scenario under which those crashes happened to you ?
Thanks, Andrey
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #3 from higuita@gmx.net --- Well, both times it happen while playing rimworld but i didn't notice any special action how to trigger this.
my hardware is a A10-7850k and a RX480, slackware64-current, dual head 1920x1080, steam+rimworld running in one head, tvtime running in the other.
I do usually suspend my machine, do not know it this also help trigger this
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #4 from higuita@gmx.net --- Created attachment 136266 --> https://bugs.freedesktop.org/attachment.cgi?id=136266&action=edit dmesg without the crash
This is my current dmesg
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #5 from higuita@gmx.net --- Created attachment 136267 --> https://bugs.freedesktop.org/attachment.cgi?id=136267&action=edit syslog capture for the oops
I didn't saved the dmesg directly but i could salvage this from the syslog
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #6 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Can you try reproduce it wit KASAN enabled ?
Thanks, Andrey
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #7 from higuita@gmx.net --- Created attachment 136398 --> https://bugs.freedesktop.org/attachment.cgi?id=136398&action=edit dmesg oops with kasan
Sure, there is the dmesg after a crash with kasan, this time over warthunder
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #8 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to higuita from comment #7)
Created attachment 136398 [details] dmesg oops with kasan
Sure, there is the dmesg after a crash with kasan, this time over warthunder
Thanks, this seems like trying to access a fence which already was released, but i can't pinpoint the faulting line in the code both for amdgpu_sync_get_fence and for amdgpu_sync_resv, I am using addr2line for this but the offset into the function shown in the backtrace doesn't make sense. Maybe because our builds differ, can you try it and see if you get the exact offending lines in both functions ?
Thanks, Andrey
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #9 from higuita@gmx.net --- Created attachment 136400 --> https://bugs.freedesktop.org/attachment.cgi?id=136400&action=edit dmesg oops with kasan 2
Another crash, this time in RUST, just to see if it helps in any way
i know how to build stuff, but i have no idea how to debug the kernel :)
can you please give me some pointers how to find and give you the needed info?
https://bugs.freedesktop.org/show_bug.cgi?id=104299
--- Comment #10 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to higuita from comment #9)
Created attachment 136400 [details] dmesg oops with kasan 2
Another crash, this time in RUST, just to see if it helps in any way
i know how to build stuff, but i have no idea how to debug the kernel :)
can you please give me some pointers how to find and give you the needed info?
NP, check answer here https://stackoverflow.com/questions/13468286/how-to-read-understand-analyze-...
to obtain the function address within your amdgpu.ko just do
nm -C drivers/gpu/drm/amd/amdgpu/amdgpu.ko | grep amdgpu_sync_get_fence nm -C drivers/gpu/drm/amd/amdgpu/amdgpu.ko | grep amdgpu_sync_resv
The offset into the function you can see from the dmesg dump amdgpu_sync_get_fence+0x91/0xe0 so 91 is the offset
Thanks, Andrey
(In reply to higuita from comment #9)
Created attachment 136400 [details] dmesg oops with kasan 2
Another crash, this time in RUST, just to see if it helps in any way
i know how to build stuff, but i have no idea how to debug the kernel :)
can you please give me some pointers how to find and give you the needed info?
https://bugs.freedesktop.org/show_bug.cgi?id=104299
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #11 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/276.
dri-devel@lists.freedesktop.org