Dear Kees,
Am 24.07.20 um 00:32 schrieb Kees Cook:
On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
When amdgpu_dm_atomic_commit_tail is running in the workqueue, drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is running, causing a race condition where state (and then dm_state) is sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has occurred since 5.7-rc1 and is well documented among polaris11 users [1].
Prior to 5.7, this was not a noticeable issue since the freelist pointer was stored at the beginning of dm_state (base), which was unused. After changing the freelist pointer to be stored in the middle of the struct, the freelist pointer overwrote the context, causing dc_state to become garbage data and made the call to dm_enable_per_frame_crtc_master_sync dereference a freelist pointer.
This patch fixes the aforementioned issue by calling drm_atomic_state_get in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
According to my testing on 5.8.0-rc6, this should fix bug 207383 on Bugzilla [1].
Nice work tracking this down!
Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
I do, however, object to this Fixes tag. :) The flaw appears to have been with amdgpu_dm's reference tracking of "state" in the nonblocking case. (How this reference counting is supposed to work correctly, though, I'm not sure.) If I look at where the drm helper was split from being the default callback, it looks like this was what introduced the bug:
da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
? 3202fa62f certainly exposed it much more quickly, but there was a race even without 3202fa62f where something could have realloced the memory and written over it.
I understand the Fixes tag mainly a help when backporting commits.
As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit 3202fa62f ("slub: relocate freelist pointer to middle of object") should be reverted for now to fix the regression for the users according to Linux’ no regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be reapplied. I know it’s not optimal, but as some testing is going to be involved for the fix, I’d argue it’s the best option for the users.
Kind regards,
Paul