Hi folks, Two weeks ago when commit 22051d9c4a57 coming to my system. Started happen randomly errors: "gnome-shell: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0" Symptoms: The screen goes out as in energy saving. And it is impossible to wake the computer in a few minutes.
I am making bisect and looks like the first bad commit is 476e955dd679. Here full bisect logs: https://mega.nz/#F!kgYFxAIb!v1tcHANPy2ns1lh4LQLeIg
I wrote about my find to the amd-gfx mailing list, but no one answer me. Until yesterday, I thought it was a bug in the amdgpu driver. But yesterday, after the next occurrence of an error, the system hangs completely already with another error.
[ 3225.317560] BUG: unable to handle page fault for address: 000000000000c9f4 [ 3225.317562] #PF: supervisor read access in kernel mode [ 3225.317563] #PF: error_code(0x0000) - not-present page [ 3225.317565] PGD 0 P4D 0 [ 3225.317567] Oops: 0000 [#1] SMP NOPTI [ 3225.317571] CPU: 2 PID: 12717 Comm: Xorg Tainted: G W 5.3.0-0.rc2.git4.1.fc31.x86_64 #1 [ 3225.317572] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2406 06/21/2019 [ 3225.317625] RIP: 0010:dc_resource_state_copy_construct+0x18/0xf0 [amdgpu] [ 3225.317627] Code: 00 49 83 c4 01 44 39 e0 7f b5 5b 5d 41 5c 41 5d c3 c3 0f 1f 44 00 00 41 56 ba f8 c9 00 00 41 55 41 54 49 89 f4 55 4c 89 e5 53 <44> 8b ae f4 c9 00 00 48 89 fe 4c 89 e7 e8 16 86 48 f7 49 8d 84 24 [ 3225.317630] RSP: 0018:ffffb439c3e377d0 EFLAGS: 00010246 [ 3225.317631] RAX: ffff9b0ba19a0000 RBX: ffffffffc08380b0 RCX: 0000000000000006 [ 3225.317633] RDX: 000000000000c9f8 RSI: 0000000000000000 RDI: ffff9b0ab7fc0000 [ 3225.317635] RBP: 0000000000000000 R08: 000002eef3c694b7 R09: 0000000000000000 [ 3225.317636] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 3225.317638] R13: ffff9b0bb5381000 R14: ffff9b09acc68598 R15: ffff9b09acc68540 [ 3225.317640] FS: 00007fdde56cbf00(0000) GS:ffff9b0bba400000(0000) knlGS:0000000000000000 [ 3225.317641] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3225.317643] CR2: 000000000000c9f4 CR3: 00000007382ee000 CR4: 00000000003406e0 [ 3225.317644] Call Trace: [ 3225.317714] amdgpu_dm_atomic_commit_tail.cold+0xad/0xe1 [amdgpu] [ 3225.317719] ? lockdep_hardirqs_on+0xf0/0x180 [ 3225.317723] ? debug_check_no_obj_freed+0x107/0x1d8 [ 3225.317786] ? dm_determine_update_type_for_commit+0x34c/0x420 [amdgpu] [ 3225.317850] ? dm_determine_update_type_for_commit+0x34c/0x420 [amdgpu] [ 3225.317855] ? kfree+0x1b6/0x3b0 [ 3225.317918] ? dm_determine_update_type_for_commit+0x34c/0x420 [amdgpu] [ 3225.317923] ? __lock_acquire+0x247/0x1910 [ 3225.317928] ? find_held_lock+0x32/0x90 [ 3225.317931] ? mark_held_locks+0x50/0x80 [ 3225.317934] ? _raw_spin_unlock_irq+0x29/0x40 [ 3225.317937] ? lockdep_hardirqs_on+0xf0/0x180 [ 3225.317939] ? _raw_spin_unlock_irq+0x29/0x40 [ 3225.317942] ? wait_for_completion_timeout+0x75/0x190 [ 3225.317954] ? commit_tail+0x3c/0x70 [drm_kms_helper] [ 3225.317960] commit_tail+0x3c/0x70 [drm_kms_helper] [ 3225.317968] drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper] [ 3225.317975] drm_atomic_helper_disable_plane+0x82/0xb0 [drm_kms_helper] [ 3225.317994] drm_mode_cursor_universal+0x12c/0x240 [drm] [ 3225.318011] drm_mode_cursor_common+0xd8/0x230 [drm] [ 3225.318026] ? drm_mode_setplane+0x1a0/0x1a0 [drm] [ 3225.318038] drm_mode_cursor_ioctl+0x4d/0x70 [drm] [ 3225.318049] drm_ioctl_kernel+0xaa/0xf0 [drm] [ 3225.318061] drm_ioctl+0x208/0x390 [drm] [ 3225.318075] ? drm_mode_setplane+0x1a0/0x1a0 [drm] [ 3225.318079] ? lockdep_hardirqs_on+0xf0/0x180 [ 3225.318145] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 3225.318164] do_vfs_ioctl+0x411/0x750 [ 3225.318175] ksys_ioctl+0x5e/0x90 [ 3225.318179] __x64_sys_ioctl+0x16/0x20 [ 3225.318188] do_syscall_64+0x5c/0xb0 [ 3225.318191] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3225.318194] RIP: 0033:0x7fdde5b4007b [ 3225.318203] Code: 0f 1e fa 48 8b 05 0d 9e 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 9d 0c 00 f7 d8 64 89 01 48 [ 3225.318209] RSP: 002b:00007ffec481a6d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 3225.318213] RAX: ffffffffffffffda RBX: 00007ffec481a710 RCX: 00007fdde5b4007b [ 3225.318215] RDX: 00007ffec481a710 RSI: 00000000c01c64a3 RDI: 000000000000000e [ 3225.318217] RBP: 00000000c01c64a3 R08: 0000000000000080 R09: 0000000000000000 [ 3225.318218] R10: 0000000000000004 R11: 0000000000000246 R12: 00000000000006f1 [ 3225.318220] R13: 000000000000000e R14: 000056201b5b5490 R15: 000056201bbe7820 [ 3225.318225] Modules linked in: macvtap macvlan tap rfcomm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat snd_hda_codec_realtek edac_mce_amd snd_hda_codec_generic ledtrig_audio kvm_amd rtwpci snd_hda_codec_hdmi rtw88 kvm snd_hda_intel snd_usb_audio snd_hda_codec mac80211 snd_hda_core snd_usbmidi_lib irqbypass snd_rawmidi uvcvideo snd_hwdep snd_seq videobuf2_vmalloc videobuf2_memops btusb videobuf2_v4l2 crct10dif_pclmul snd_seq_device videobuf2_common btrtl crc32_pclmul eeepc_wmi snd_pcm btbcm btintel asus_wmi xpad snd_timer sparse_keymap [ 3225.318261] videodev ff_memless bluetooth joydev ghash_clmulni_intel cfg80211 video snd mc k10temp wmi_bmof soundcore ecdh_generic sp5100_tco ecc rfkill ccp i2c_piix4 libarc4 gpio_amdpt gpio_generic acpi_cpufreq binfmt_misc ip_tables hid_logitech_hidpp amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper drm igb crc32c_intel dca i2c_algo_bit hid_logitech_dj nvme nvme_core wmi pinctrl_amd [ 3225.318283] CR2: 000000000000c9f4
Every time when I see "SMP NOPTI" error I think that something wrong happens with memory management. So I decided to ask for help on the linux-mm mailing list. Anyway for unknown reasons AMD developers ignored me.
Thanks.
-- Best Regards, Mike Gavrilov.
On Mon, 5 Aug 2019 at 08:23, Mikhail Gavrilov mikhail.v.gavrilov@gmail.com wrote:
Hi folks, Two weeks ago when commit 22051d9c4a57 coming to my system. Started happen randomly errors: "gnome-shell: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0" Symptoms: The screen goes out as in energy saving. And it is impossible to wake the computer in a few minutes.
I am making bisect and looks like the first bad commit is 476e955dd679. Here full bisect logs: https://mega.nz/#F!kgYFxAIb!v1tcHANPy2ns1lh4LQLeIg
I wrote about my find to the amd-gfx mailing list, but no one answer me. Until yesterday, I thought it was a bug in the amdgpu driver. But yesterday, after the next occurrence of an error, the system hangs completely already with another error.
Does it happen if you disable CONFIG_DRM_AMD_DC_DCN2_0, I'm assuming you don't have a navi gpu.
I think some struct grew too large in the navi merge, hopefully amd care, else we have to disable navi before release.
I've directed this at the main AMD devs who might be helpful.
Dave.
dri-devel@lists.freedesktop.org