https://bugzilla.kernel.org/show_bug.cgi?id=207383
--- Comment #77 from Kees Cook (kees@outflux.net) --- (Midair collision... you saw the same about the structure layout as I did. Here's my comment...)
(In reply to mnrzk from comment #30)
I've been looking at this bug for a while now and I'll try to share what I've found about it.
In some conditions, when amdgpu_dm_atomic_commit_tail calls dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct dm_atomic_state* with an garbage context pointer.
It looks like when amdgpu_dm_atomic_commit_tail() walks the private objects list with for_each_new_private_obj_in_state(), it'll return the first object's state when the function pointer tables match. This is a struct dm_atomic_state allocation, which is 16 bytes:
struct drm_private_state { struct drm_atomic_state *state; };
struct dm_atomic_state { struct drm_private_state base; struct dc_state *context; };
If struct dm_atomic_state is being freed early, this would match the behavior seen: before 3202fa62f, .base.state would be overwritten with a freelist pointer. After 3202fa62f, .context will be overwritten.
In looking for all "kfree(.*state" patterns in drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c, I see a few suspicious things, maybe. dm_crtc_destroy_state() and amdgpu_dm_connector_funcs_reset() do an explicit kfree(state) -- should they use dm_atomic_destroy_state() instead? Or nothing at all, since I'd expect "state" to be managed by the drm layer via the .atomic_destroy_state callback?
I've also found that this bug exclusively occurs when commit_work is on the workqueue. After forcing drm_atomic_helper_commit to run all of the commits without adding to the workqueue and running the OS, the issue seems to have disappeared. The system was stable for at least 1.5 hours before I manually shut it down (meanwhile it has usually crashed within 30-45 minutes).
Is this the async call to "commit_work" in drm_atomic_helper_commit()?
There's a big warning in there:
/* * Everything below can be run asynchronously without the need to grab * any modeset locks at all under one condition: It must be guaranteed * that the asynchronous work has either been cancelled (if the driver * supports it, which at least requires that the framebuffers get * cleaned up with drm_atomic_helper_cleanup_planes()) or completed * before the new state gets committed on the software side with * drm_atomic_helper_swap_state(). ...
I'm not sure how to determine if amdgpu_dm.c is doing this correctly?
I can't tell what can interfere with drm_atomic_helper_commit() -- I would guess the race is between that and something else causing a kfree(), but I don't know the APIs here at all...