https://bugzilla.kernel.org/show_bug.cgi?id=196635
Bug ID: 196635 Summary: amdgpu clinfo hangs with SI Product: Drivers Version: 2.5 Kernel Version: 4.13-rc4 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: janpieter.sollie@dommel.be Regression: No
Created attachment 257863 --> https://bugzilla.kernel.org/attachment.cgi?id=257863&action=edit dmesg output
the clinfo command does not work anymore since I tested 4.13-rc4 on my pc. kernel error message in attachment. I'm using the amdgpu-pro libraries (not the kernel driver, really only the libraries) on a dual opteron with a R9 nano and a HD 7700 installed. I must say I was very happy DPM is working now, but the clinfo calls not working is a bit of a bummer :(
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #1 from Janpieter Sollie (janpieter.sollie@dommel.be) --- Created attachment 257865 --> https://bugzilla.kernel.org/attachment.cgi?id=257865&action=edit lspci output
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #2 from Janpieter Sollie (janpieter.sollie@dommel.be) --- Created attachment 257867 --> https://bugzilla.kernel.org/attachment.cgi?id=257867&action=edit kernel config
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #3 from Janpieter Sollie (janpieter.sollie@dommel.be) --- Created attachment 257881 --> https://bugzilla.kernel.org/attachment.cgi?id=257881&action=edit working dmesg
dmesg with amdgpu.dpm=0 seems to intitialize the device correctly
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #4 from Janpieter Sollie (janpieter.sollie@dommel.be) --- Created attachment 257883 --> https://bugzilla.kernel.org/attachment.cgi?id=257883&action=edit working clinfo
clinfo output with amdgpu.dpm=0
https://bugzilla.kernel.org/show_bug.cgi?id=196635
Janpieter Sollie (janpieter.sollie@dommel.be) changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugzilla.kernel.org | |/show_bug.cgi?id=194899
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #5 from Janpieter Sollie (janpieter.sollie@dommel.be) --- the system works with dpm=0. I attached some info about the working system. Please note that I DO NOT use the amdgpu-pro kernel module, only its libraries
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #6 from Michel Dänzer (michel@daenzer.net) --- Can you bisect the kernel?
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #7 from Janpieter Sollie (janpieter.sollie@dommel.be) --- I'm not a kernel developer, but I am willing to help you where I can. what do you need from the bisection?
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #8 from Michel Dänzer (michel@daenzer.net) --- No need to be a developer, just to compile and test a number of kernel Git commits. Search for "git bisect howto".
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #9 from Janpieter Sollie (janpieter.sollie@dommel.be) --- I just browsed through a few howtos: It won't be easy to point to the problem: in 4.10, it hit a triple fault and then crashed with dpm enabled. do you want a bisection from that one(see 194899) to the current status or do I need to do something else?
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #10 from Michel Dänzer (michel@daenzer.net) --- (In reply to Janpieter Sollie from comment #9)
Hmm, I guess I misread the bug description as meaning it worked properly before. If that's not the case, there's probably no point in bisecting.
https://bugzilla.kernel.org/show_bug.cgi?id=196635
Janpieter Sollie (janpieter.sollie@dommel.be) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |PATCH_ALREADY_AVAILABLE
--- Comment #11 from Janpieter Sollie (janpieter.sollie@dommel.be) --- the problem SEEMS to be with CIK support and upgrade to rc6 ... disabling CIK support in my kernel and upgrading it to rc6 solved the problem. Probably CIK and SI are not really cooperating properly yet.
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #12 from Michel Dänzer (michel@daenzer.net) --- (In reply to Janpieter Sollie from comment #11)
Weird, does rc6 still work with CIK support enabled?
https://bugzilla.kernel.org/show_bug.cgi?id=196635
--- Comment #13 from Janpieter Sollie (janpieter.sollie@dommel.be) --- yes. But I really think the problem is application-layer: I do not see any errors in dmesg when running clinfo, but when I run the application I'm developing, I see the following errors in dmesg: [31637.263268] amdgpu 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014 address=0x00000000ff9f4000 flags=0x0000] [31637.263379] amdgpu 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014 address=0x00000000ff9e4000 flags=0x0000] ... and the application hangs the interesting part here is: to make sure the driver does not "accidentally work", I added a polaris device to the system. The amdgpu recognised the polaris, fiji and SI, but only the SI gives these faults.
do you know how I can figure out whether this is a kernel / midline / application layer problem?
dri-devel@lists.freedesktop.org