https://bugzilla.kernel.org/show_bug.cgi?id=207383
Duncan (1i5t5.duncan@cox.net) changed:
What |Removed |Added ---------------------------------------------------------------------------- Kernel Version|5.7-rc1, 5.7-rc2, 5.7-rc3 |5.7-rc1 - 5.7 - 5.8-rc1+
--- Comment #31 from Duncan (1i5t5.duncan@cox.net) --- (In reply to mnrzk from comment #30)
In some conditions, when amdgpu_dm_atomic_commit_tail calls dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct dm_atomic_state* with an garbage context pointer.
Good! Someone with the bug who can actually read and work the code, now. Portends well for a fix. =:^)
I've also found that this bug exclusively occurs when commit_work is on the workqueue. After forcing drm_atomic_helper_commit to run all of the commits without adding to the workqueue and running the OS, the issue seems to have disappeared.
I see it always with the workqueue too, but not being a dev I simply assumed that was how it was; I had no idea it could be taken off the workqueue.
The system was stable for at least 1.5 hours before I manually shut it down (meanwhile it has usually crashed within 30-45 minutes).
You're seeing a crash much faster than I am. I believe my longest uptime before a crash with the telltale trace was something like two and a half days, with the obvious implications for bisect good since it's always a gamble that I've simply not tested long enough.
Perhaps there's some sort of race condition occurring after commit_work is queued?
Agreed, FWIW, tho you've taken it farther than I could, not being able to work with code much beyond bisect or modifying an existing patch here or there.