https://bugs.freedesktop.org/show_bug.cgi?id=93460
Bug ID: 93460 Summary: [amdgpu] Ooops during shutdown - amdgpu_vm_grab_id Product: DRI Version: DRI git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: mike@fireburn.co.uk
This doesn't happen on the powerplay branch but it does happen on Linus's tree 4.4-rc5
As this appears related to the scheduler I can go back to kernel 4.3 and test that and if it doesn't happen try and bisect if you think it's worthwhile
https://bugs.freedesktop.org/show_bug.cgi?id=93460
Mike Lothian mike@fireburn.co.uk changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |mike@fireburn.co.uk
--- Comment #1 from Mike Lothian mike@fireburn.co.uk --- Created attachment 120605 --> https://bugs.freedesktop.org/attachment.cgi?id=120605&action=edit Screenshot of oops
https://bugs.freedesktop.org/show_bug.cgi?id=93460
--- Comment #2 from Mike Lothian mike@fireburn.co.uk --- I tried to bisect between v4.3 and HEAD but there were too many other issues getting in the way - i915, ath10k - making remoting in too difficult when the screen wasn't showing anything
https://bugs.freedesktop.org/show_bug.cgi?id=93460
--- Comment #3 from david1.zhou@amd.com david1.zhou@amd.com --- 0x24e7e is in amdgpu_vm_grab_id (include/linux/fence.h:292). 287 * Returns true if f1 is chronologically later than f2. Both fences must be 288 * from the same context, since a seqno is not re-used across contexts. 289 */ 290 static inline bool fence_is_later(struct fence *f1, struct fence *f2) 291 { 292 if (WARN_ON(f1->context != f2->context)) 293 return false;
This should be normal warnings, isn't bug.
https://bugs.freedesktop.org/show_bug.cgi?id=93460
--- Comment #4 from Mike Lothian mike@fireburn.co.uk --- When this happens my machine just freezes, the only way to continue is to press and hold the power button but that doesn't cleanly unmount the disks
https://bugs.freedesktop.org/show_bug.cgi?id=93460
--- Comment #5 from Christian König deathsimple@vodafone.de --- (In reply to Mike Lothian from comment #4)
When this happens my machine just freezes, the only way to continue is to press and hold the power button but that doesn't cleanly unmount the disks
Yeah, that is clearly a bug when the driver unloads.
Probably rather hard to reproduce, we should add a test case which loads and unloads the driver multiple times while there is load.
https://bugs.freedesktop.org/show_bug.cgi?id=93460
--- Comment #6 from david1.zhou@amd.com david1.zhou@amd.com --- (In reply to Christian König from comment #5)
(In reply to Mike Lothian from comment #4)
When this happens my machine just freezes, the only way to continue is to press and hold the power button but that doesn't cleanly unmount the disks
Yeah, that is clearly a bug when the driver unloads.
Probably rather hard to reproduce, we should add a test case which loads and unloads the driver multiple times while there is load.
Maybe we shall avoid to use fence for vmid, instead using LRU list.
https://bugs.freedesktop.org/show_bug.cgi?id=93460
--- Comment #7 from Christian König deathsimple@vodafone.de --- (In reply to david1.zhou@amd.com from comment #6)
Maybe we shall avoid to use fence for vmid, instead using LRU list.
Yeah, thought about that as well. The problem is that we used to have an LRU list and I switched to fences because they had less overhead.
We still need to keep the fences around for synchronization, so I'm not sure if that would really help.
The real price question is what is going wrong here?
https://bugs.freedesktop.org/show_bug.cgi?id=93460
--- Comment #8 from david1.zhou@amd.com david1.zhou@amd.com --- (In reply to Christian König from comment #7)
(In reply to david1.zhou@amd.com from comment #6)
Maybe we shall avoid to use fence for vmid, instead using LRU list.
Yeah, thought about that as well. The problem is that we used to have an LRU list and I switched to fences because they had less overhead.
We still need to keep the fences around for synchronization, so I'm not sure if that would really help.
The real price question is what is going wrong here?
yes, we need to identify why the contexts of two fences are different, where two fences come from, what the kind of two fences are, which ring two fences belong.
https://bugs.freedesktop.org/show_bug.cgi?id=93460
Mike Lothian mike@fireburn.co.uk changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED
--- Comment #9 from Mike Lothian mike@fireburn.co.uk --- Not seen this in a while now
dri-devel@lists.freedesktop.org