https://bugs.freedesktop.org/show_bug.cgi?id=102500
Bug ID: 102500 Summary: [polaris10][amd-staging-4.12] GPU fault detected, somethimes lockup Product: DRI Version: DRI git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: arek.rusi@gmail.com
Created attachment 133914 --> https://bugs.freedesktop.org/attachment.cgi?id=133914&action=edit dmesg - start gnome 3 session only
Hi, Afer today kernel update witcher 3 hangs. After restart (only Gnome3 session is running) in kernel log i see lot of: ... amdgpu 0000:01:00.0: GPU fault detected: 146 0x0d023d14 amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001061A0 amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014 amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1073568, write from 'SDM1' (0x53444d31) (61) ... downgrade kernel resolve this issue.
I use binary kernel from unofficial arch's repo: linux-amd-staging 4.12.0.680862.7be0a528b097 - the bad one linux-amd-staging-4.12.0.680853.eeb9985d7228 - works great
these kernels are built from alex's git tree, branch amd-staging-4.12
OpenGL renderer string: AMD Radeon (TM) RX 470 Graphics (POLARIS10 / DRM 3.18.0 / 4.13.0-rc7-mainline, LLVM 6.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.3.0-devel (git-2d93b462b4)
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #1 from Alex Deucher alexdeucher@gmail.com --- can you bisect?
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #2 from Arek Ruśniak arek.rusi@gmail.com --- Ok, sometimes gnome session refuse to work or even crashes on it (leds blinking on my kb)
1753d85bc82849deeb68cb5d7883207f0acbddc4 is the first bad commit commit 1753d85bc82849deeb68cb5d7883207f0acbddc4 Author: Christian König christian.koenig@amd.com Date: Tue Aug 29 16:14:32 2017 +0200
drm/amdgpu: bump version for support of local BOs
Signed-off-by: Christian König christian.koenig@amd.com Reviewed-by: Felix Kuehling Felix.Kuehling@amd.com
:040000 040000 fb4af6a5aa54bac7afddeb83db09105ce7dab3e5 b30f9c48abef6ddd2bd3a34dd447c0543ca1b29e M drivers
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #3 from Arek Ruśniak arek.rusi@gmail.com --- Created attachment 133923 --> https://bugs.freedesktop.org/attachment.cgi?id=133923&action=edit dmesg for first bad commit
autostart throu gdm to gnome3-session failed so I booted into multi-user (no vm faults yet) and then started fluxbox from tty
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #4 from Dieter Nützel Dieter@nuetzel-hh.de --- (In reply to Alex Deucher from comment #1)
can you bisect?
Hello all,
I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time: 1. Sep 02:14 CEST) update, too. Will go to bisect in the evening.
[ 262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14 [ 262.462946] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101540 [ 262.462949] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014 [ 262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016, write from 'SDM1' (0x53 444d31) (61)
https://bugs.freedesktop.org/show_bug.cgi?id=102500
Dieter Nützel Dieter@nuetzel-hh.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |deathsimple@vodafone.de
--- Comment #5 from Dieter Nützel Dieter@nuetzel-hh.de --- (In reply to Dieter Nützel from comment #4)
(In reply to Alex Deucher from comment #1)
can you bisect?
Hello all,
I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time:
- Sep 02:14 CEST) update, too. Will go to bisect in the evening.
[ 262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14 [ 262.462946] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101540 [ 262.462949] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014 [ 262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016, write from 'SDM1' (0x53 444d31) (61)
Yes,
git revert fd8bf087dffc
commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1 Author: Christian König christian.koenig@amd.com Date: Tue Aug 29 16:14:32 2017 +0200
drm/amdgpu: bump version for support of local BOs
Signed-off-by: Christian König christian.koenig@amd.com Reviewed-by: Felix Kuehling Felix.Kuehling@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com
Solve it on 'amd-staging-drm-next', too.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #6 from Arek Ruśniak arek.rusi@gmail.com --- Today's build is fine. There's no vm-fault anymore. Thx for fix. Dieter could you confirm that for staging-next tree?
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #7 from Christian König deathsimple@vodafone.de --- Does patch "drm/amdgpu: fix moved list handling in the VM" fix the issue?
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #8 from Arek Ruśniak arek.rusi@gmail.com --- Something wrong happened in my build environment or i've got just luck with earlier test.There's no fix. "GPU fault detected" still happenning.
sorry for inconvinient
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #9 from Vedran Miletić vedran@miletic.net --- (In reply to Dieter Nützel from comment #5)
(In reply to Dieter Nützel from comment #4)
(In reply to Alex Deucher from comment #1)
can you bisect?
Hello all,
I get the same on 'amd-staging-drm-next' since 1. of Sep (kernel build time:
- Sep 02:14 CEST) update, too. Will go to bisect in the evening.
[ 262.462941] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0a023d14 [ 262.462946] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101540 [ 262.462949] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0903D014 [ 262.462952] amdgpu 0000:01:00.0: VM fault (0x14, vmid 4) at page 1054016, write from 'SDM1' (0x53 444d31) (61)
Yes,
git revert fd8bf087dffc
commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1 Author: Christian König christian.koenig@amd.com Date: Tue Aug 29 16:14:32 2017 +0200
drm/amdgpu: bump version for support of local BOs Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Solve it on 'amd-staging-drm-next', too.
Confirmed fixed by reverting on Vega 10.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
Vedran Miletić vedran@miletic.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |vedran@miletic.net Summary|[polaris10][amd-staging-4.1 |[polaris10, |2] GPU fault detected, |vega10][amd-staging-4.12, |somethimes lockup |amd-staging-drm-next] GPU | |fault detected, somethimes | |lockup
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #10 from Arek Ruśniak arek.rusi@gmail.com --- additional info: I try figure out why in my earlier test everything went ok and probably mesa is the trigger,
Linux-amd-staging + Mesa-git + LLVM-svn - failure Linux-amd-staging + Mesa-git + LLVM 4.0.1 - failure Linux-amd-staging + Mesa 17.1.8 + LLVM 4.0.1 - works ok. I try later some bisecting, we will see.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #11 from Arek Ruśniak arek.rusi@gmail.com --- on mesa side looks like this is it:
214b565bc28bc4419f3eec29ab7bbe34080459fe is the first bad commit commit 214b565bc28bc4419f3eec29ab7bbe34080459fe Author: Christian König christian.koenig@amd.com Date: Tue Aug 29 16:45:46 2017 +0200
winsys/amdgpu: set AMDGPU_GEM_CREATE_VM_ALWAYS_VALID if possible v2
When the kernel supports it set the local flag and stop adding those BOs to the BO list.
Can probably be optimized much more.
v2: rename new flag to AMDGPU_GEM_CREATE_VM_ALWAYS_VALID
Reviewed-by: Marek Olšák marek.olsak@amd.com
:040000 040000 2e4b2737f37ede2bbdbbe6815fe0fa562177c2b7 3482c86ed92116adff7ab12b2d4de870746a1df6 M src
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #12 from Christian König deathsimple@vodafone.de --- To repeat my question: Does patch "drm/amdgpu: fix moved list handling in the VM" fix the issue?
Do you guys have this in your kernel branch yet? If not that lockup is expected.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #13 from Arek Ruśniak arek.rusi@gmail.com --- Christian sorry, I thought that was clear. Yes, I updated ASAP so it contains: https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-4.12&id=... Doesn't help for vm-faults
Every test right before and after your comment is for: linux-amd-staging-4.12-c5def4cbdb61
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #14 from Christian König deathsimple@vodafone.de --- (In reply to Arek Ruśniak from comment #13)
Christian sorry, I thought that was clear.
No problem, that just means that this is the same issue I'm still hunting for.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #15 from Christian König deathsimple@vodafone.de --- Created attachment 134082 --> https://bugs.freedesktop.org/attachment.cgi?id=134082&action=edit Possible fix
Please try the attached kernel patch.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #16 from Arek Ruśniak arek.rusi@gmail.com --- Patch fixes issue. I've tried both staging-4.12 and staging-drm-next branches. Thanks Christian
PS. It will be nice if Vedran could confirmed this for Vega before we close.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #17 from Dieter Nützel Dieter@nuetzel-hh.de --- (In reply to Christian König from comment #15)
Created attachment 134082 [details] [review] Possible fix
Please try the attached kernel patch.
Hello Christian,
you've made your 'homework'...;-)
To repeat my question: Does patch "drm/amdgpu: fix moved list handling in the VM" fix the issue?
Do you guys have this in your kernel branch yet? If not that lockup is expected.
No, I haven't. It was fallen into the cranks of the repeated DC rebase of Alex's 'amd-staging-drm-next' tree (didn't noticed it for the last 7 days, Alex vacation). I'll make it short. NO that didn't solve it for me, too.
But _this_ patch is GOLD: drm-amdgpu-fix-VM-sync-with-always-valid-BOs.mbox
Tested-by: Dieter Nützel Dieter@nuetzel-hh.de
Best 'glmark2' Score I've ever seen. RX580, 8 GB Xeon X3470, 4/8, 3 GHz 24 GB
glmark2 Score: 6428
with additional load on the gfx cores through parallel running 'opencl-example/run_tests.sh' I got
glmark2 Score: 7574
Good job!
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #18 from Vedran Miletić vedran@miletic.net --- (In reply to Arek Ruśniak from comment #16)
Patch fixes issue. I've tried both staging-4.12 and staging-drm-next branches. Thanks Christian
PS. It will be nice if Vedran could confirmed this for Vega before we close.
I can confirm that after applying the patch the issue doesn't occur for me. (I hope that's enough, I can't claim more than that since I have done 2-3 upgrades of mesa/llvm since I last tested the broken kernel.)
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #19 from charlie bug0xa3d2@hushmail.com --- Bug 102500 might be related to bug 102598.
I tried to apply patch attachment 134082 to amd-staging-4.12 (~agd5f/linux) kernel and drm-next-4.15-wip but it does not apply cleanly. I applied it manually to drm-next-4.15-wip and that kernel would not finish compiling.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #20 from charlie bug0xa3d2@hushmail.com --- I confirm that bug 102500 and bug 102598 are the same.
I split up the patch into 3 parts and they applied cleanly with offsets to drm-next-4.15-wip.
I then reverted mesa to commit 214b565bc28bc4419f3eec29ab7bbe34080459fe (winsys/amdgpu: set AMDGPU_GEM_CREATE_VM_ALWAYS_VALID if possible v2) compiled and started X and corruption and lockups are gone.
https://bugs.freedesktop.org/show_bug.cgi?id=102500
charlie bug0xa3d2@hushmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |bug0xa3d2@hushmail.com
--- Comment #21 from charlie bug0xa3d2@hushmail.com --- *** Bug 102598 has been marked as a duplicate of this bug. ***
https://bugs.freedesktop.org/show_bug.cgi?id=102500
--- Comment #22 from Vedran Miletić vedran@miletic.net --- The patch has been included in amd-staging-drm-next for a while, should this bug be closed?
https://bugs.freedesktop.org/show_bug.cgi?id=102500
Arek Ruśniak arek.rusi@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED
dri-devel@lists.freedesktop.org