[Bug 207383] [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail

21 Jul 2020


      https://bugzilla.kernel.org/show_bug.cgi?id=207383
--- Comment #77 from Kees Cook (kees@outflux.net) ---
(Midair collision... you saw the same about the structure layout as I did.
Here's my comment...)
(In reply to mnrzk from comment #30)
...
I've been looking at this bug for a while now and I'll try to share what
I've found about it.
In some conditions, when amdgpu_dm_atomic_commit_tail calls
dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
dm_atomic_state* with an garbage context pointer.
It looks like when amdgpu_dm_atomic_commit_tail() walks the private objects
list with for_each_new_private_obj_in_state(), it'll return the first object's
state when the function pointer tables match. This is a struct dm_atomic_state
allocation, which is 16 bytes:
struct drm_private_state {
        struct drm_atomic_state *state;
};
struct dm_atomic_state {
        struct drm_private_state base;
        struct dc_state *context;
};
If struct dm_atomic_state is being freed early, this would match the behavior
seen: before 3202fa62f, .base.state would be overwritten with a freelist
pointer. After 3202fa62f, .context will be overwritten.
In looking for all "kfree(.*state" patterns in
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c, I see a few suspicious
things, maybe. dm_crtc_destroy_state() and amdgpu_dm_connector_funcs_reset() do
an explicit kfree(state) -- should they use dm_atomic_destroy_state() instead?
Or nothing at all, since I'd expect "state" to be managed by the drm layer via
the .atomic_destroy_state callback?
...
I've also found that this bug exclusively occurs when commit_work is on the
workqueue. After forcing drm_atomic_helper_commit to run all of the commits
without adding to the workqueue and running the OS, the issue seems to have
disappeared. The system was stable for at least 1.5 hours before I manually
shut it down (meanwhile it has usually crashed within 30-45 minutes).
Is this the async call to "commit_work" in drm_atomic_helper_commit()?
There's a big warning in there:
/*
         * Everything below can be run asynchronously without the need to grab
         * any modeset locks at all under one condition: It must be guaranteed
         * that the asynchronous work has either been cancelled (if the driver
         * supports it, which at least requires that the framebuffers get
         * cleaned up with drm_atomic_helper_cleanup_planes()) or completed
         * before the new state gets committed on the software side with
         * drm_atomic_helper_swap_state().
         ...
I'm not sure how to determine if amdgpu_dm.c is doing this correctly?
I can't tell what can interfere with drm_atomic_helper_commit() -- I would
guess the race is between that and something else causing a kfree(), but I
don't know the APIs here at all...
-- 
You are receiving this mail because:
You are watching the assignee of the bug.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[Bug 207383] [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail