https://bugzilla.kernel.org/show_bug.cgi?id=194867
Bug ID: 194867 Summary: DRM BUG while initializing cape verde (2nd card) Product: Drivers Version: 2.5 Kernel Version: 4.11-rc2 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: janpieter.sollie@dommel.be Regression: No
Created attachment 255215 --> https://bugzilla.kernel.org/attachment.cgi?id=255215&action=edit zip file with all listed attachments
There seems to be a logical error while specifying the memory sizes for ttm in the amdgpu module on the SI architecture: while the Fiji card boots fine, the Cape Verde card gives a kernel BUG. dmesg and .config and proposed patch in attachment. the problem lies in linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: the determination of the p_size is reduced 0 when the page_shift is too big I managed to work around the problem when changing the sentence "adev->gds.mem.total_size >> PAGE_SHIFT)" in amdgpu_ttm_init to "(adev->gds.mem.total_size >> PAGE_SHIFT) + 1)", and the same for "(adev->gds.gws.total_size" and "adev->gds.oa.total_size", though I am not sure this is the correct solution. The problem is that my SI card is limited in memory (I guess) and the page_size is 12
https://bugzilla.kernel.org/show_bug.cgi?id=194867
Alex Deucher (alexdeucher@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #1 from Alex Deucher (alexdeucher@gmail.com) --- Created attachment 255263 --> https://bugzilla.kernel.org/attachment.cgi?id=255263&action=edit patch 1/2
Does this patch set fix the issue?
https://bugzilla.kernel.org/show_bug.cgi?id=194867
--- Comment #2 from Alex Deucher (alexdeucher@gmail.com) --- Created attachment 255265 --> https://bugzilla.kernel.org/attachment.cgi?id=255265&action=edit patch 2/2
https://bugzilla.kernel.org/show_bug.cgi?id=194867
--- Comment #3 from Janpieter Sollie (janpieter.sollie@dommel.be) --- we're one step further: see triplefault.txt output. I set the kernel verbosity to 7, and did a modprobe amdgpu (the module is blacklisted). The error is gone, but the machine hits a triple fault (I suspect it does, don't blame me when it doesn't) and because of that, it immediately reboots without panic. should I file a new bug for that, or can you have a look at it? notice that this does not happen with dpm disabled.
https://bugzilla.kernel.org/show_bug.cgi?id=194867
--- Comment #4 from Janpieter Sollie (janpieter.sollie@dommel.be) --- Created attachment 255283 --> https://bugzilla.kernel.org/attachment.cgi?id=255283&action=edit /proc/kmsg output
https://bugzilla.kernel.org/show_bug.cgi?id=194867
--- Comment #5 from Michel Dänzer (michel@daenzer.net) --- Did this work with kernel 4.10 or older?
https://bugzilla.kernel.org/show_bug.cgi?id=194867
--- Comment #6 from Janpieter Sollie (janpieter.sollie@dommel.be) --- no, the output is exactly the same: after the 4 ring tests, it reboots
https://bugzilla.kernel.org/show_bug.cgi?id=194867
--- Comment #7 from Michel Dänzer (michel@daenzer.net) --- That should be tracked in a separate report then.
https://bugzilla.kernel.org/show_bug.cgi?id=194867
Janpieter Sollie (janpieter.sollie@dommel.be) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |CODE_FIX
--- Comment #8 from Janpieter Sollie (janpieter.sollie@dommel.be) --- is there any documentation except my kernel config, lsmod, /proc/kmsg and lspci you need to handle this as a new report? 'cause I can annoy myself with users coming to me saying 'it doesn't work and i know nothing about it', so I'd like to provide every possible info you people need
dri-devel@lists.freedesktop.org