[PATCH 00/18] dma-fence lockdep annotations, round 2

List overview All Threads
Download

newer

older

[RFC] Host1x/TegraDRM UAPI

[PATCH] drm/aspeed: Call...

Daniel Vetter

4 Jun 2020 4 Jun '20

8:12 a.m.

Hi all,

Still very much early stuff, still very much looking for initial thoughts and maybe some ideas how this could all be rolled out across drivers.

Full intro probably best from the RFC cover letter:

https://lore.kernel.org/amd-gfx/20200512085944.222637-1-daniel.vetter@ffwll....

Changes since last time around:

- might_sleep annotation has landed already, I split that out as a stand-alone

- now with an mm patch to improve direct reclaim annotations for mmu notifiers. This allows us to very easily catch issues in that area, no more need for exaustive testing and luck to make sure we're not leaving a GFP_NOFS or GPF_NOIO around which should be a GFP_ATOMIC

- kerneldoc that explains all the reasoning behind the annotations and priming, hopefully

Driver patches still largely just meant as examples to illustrate usage, but from various irc chats I think discussing them is really useful to gain clarity on the exact places the annotations should be put.

Cheers, Daniel

Daniel Vetter (18): mm: Track mmu notifiers in fs_reclaim_acquire/release dma-buf: minor doc touch-ups dma-fence: basic lockdep annotations dma-fence: prime lockdep annotations drm/vkms: Annotate vblank timer drm/vblank: Annotate with dma-fence signalling section drm/atomic-helper: Add dma-fence annotations drm/amdgpu: add dma-fence annotations to atomic commit path drm/scheduler: use dma-fence annotations in main thread drm/amdgpu: use dma-fence annotations in cs_submit() drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code drm/amdgpu: DC also loves to allocate stuff where it shouldn't drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail drm/scheduler: use dma-fence annotations in tdr work drm/amdgpu: use dma-fence annotations for gpu reset code Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset" drm/amdgpu: gpu recovery does full modesets drm/i915: Annotate dma_fence_work

-- 2.26.2

Show replies by date

Daniel Vetter

4 Jun 4 Jun

8:12 a.m.

New subject: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever. -Daniel --- mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 23 ++++++++++++++--------- 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);

- if (IS_ENABLED(CONFIG_LOCKDEP)) { - fs_reclaim_acquire(GFP_KERNEL); - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); - fs_reclaim_release(GFP_KERNEL); - } - if (!mm->notifier_subscriptions) { /* * kmalloc cannot be called under mm_take_all_locks(), but we diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;

- /* We're only interested __GFP_FS allocations for now */ - if (!(gfp_mask & __GFP_FS)) - return false; - if (gfp_mask & __GFP_NOLOCKDEP) return false;

@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_acquire(); + if (__need_reclaim(gfp_mask)) { + if (!(gfp_mask & __GFP_FS)) + __fs_reclaim_acquire(); + + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); + + } } EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_release(); + if (__need_reclaim(gfp_mask)) { + if (!(gfp_mask & __GFP_FS)) + __fs_reclaim_release(); + } } EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif

-- 2.26.2

Thomas Hellström (Intel)

10 Jun 10 Jun

12:01 p.m.

New subject: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

Hi, Daniel,

Please see below.

On 6/4/20 10:12 AM, Daniel Vetter wrote:

...

fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever.

-Daniel

mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 23 ++++++++++++++--------- 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);
if (IS_ENABLED(CONFIG_LOCKDEP)) {
fs_reclaim_acquire(GFP_KERNEL);
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
fs_reclaim_release(GFP_KERNEL);
}

if (!mm->notifier_subscriptions) { /*

kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */

if (!(gfp_mask & __GFP_FS))
return false;
if (gfp_mask & __GFP_NOLOCKDEP) return false;
@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
__fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
if (!(gfp_mask & __GFP_FS))

Hmm. Shouldn't this be "if (gfp_mask & __GFP_FS)" or am I misunderstanding?

...

```
	__fs_reclaim_acquire();
```

#ifdef CONFIG_MMU_NOTIFIER?

...

lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
} } EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
__fs_reclaim_release();
if (__need_reclaim(gfp_mask)) {
if (!(gfp_mask & __GFP_FS))

Same here?

...

	__fs_reclaim_release();
} } EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif

One suggested test case would perhaps be to call madvise(madv_dontneed) on a subpart of a transhuge page. That would IIRC trigger a page split and interesting mmu notifier calls....

Thanks, Thomas

Daniel Vetter

12:25 p.m.

New subject: [Intel-gfx] [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Wed, Jun 10, 2020 at 2:01 PM Thomas Hellström (Intel) thomas_os@shipmail.org wrote:

...

Hi, Daniel,

Please see below.

On 6/4/20 10:12 AM, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever.

-Daniel

mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 23 ++++++++++++++--------- 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);
if (IS_ENABLED(CONFIG_LOCKDEP)) {
        fs_reclaim_acquire(GFP_KERNEL);
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
        fs_reclaim_release(GFP_KERNEL);
}
if (!mm->notifier_subscriptions) {
        /*
         * kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */
if (!(gfp_mask & __GFP_FS))
        return false;
if (gfp_mask & __GFP_NOLOCKDEP)
        return false;
@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
        if (!(gfp_mask & __GFP_FS))
Hmm. Shouldn't this be "if (gfp_mask & __GFP_FS)" or am I misunderstanding?

Uh yes :-( I guess what saved me is that I immediately went for the lockdep splat in drivers/gpu. And I guess there's not any obvious inversions for GFP_NOFS/GFP_NOIO, and since I made the mistake consintely the GFP_FS annotation was still consistent, but simply for GFP_NOFS. Oops.

Will fix in the next version.

...

                __fs_reclaim_acquire();

#ifdef CONFIG_MMU_NOTIFIER?

Hm indeed. Will fix too.

Thanks for your review.

...

...
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
}
} EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_release();
if (__need_reclaim(gfp_mask)) {
        if (!(gfp_mask & __GFP_FS))
Same here?

...
                __fs_reclaim_release();
}
} EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif
One suggested test case would perhaps be to call madvise(madv_dontneed) on a subpart of a transhuge page. That would IIRC trigger a page split and interesting mmu notifier calls....

The neat thing about the mmu notifier lockdep key is that we take it whether there's notifiers or not - it's called outside of any of these paths. So as long as you have ever hit a hugepage split somewhen since boot, and you've hit your driver's mmu_notifier paths, lockdep will connect the dots. Explicit testcases for all combinations not needed anymore. This patch here just makes sure that the same holds for memory allocations and direct reclaim (which is a lot harder to trigger intentionally in testcases).

That was at least the idea, seems to have caught a few things already. -Daniel

...

Thanks, Thomas

Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

7:41 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom: - unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately - fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever. -Daniel --- mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 25 ++++++++++++++++--------- 2 files changed, 16 insertions(+), 16 deletions(-)

- if (IS_ENABLED(CONFIG_LOCKDEP)) { - fs_reclaim_acquire(GFP_KERNEL); - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); - fs_reclaim_release(GFP_KERNEL); - } - if (!mm->notifier_subscriptions) { /* * kmalloc cannot be called under mm_take_all_locks(), but we diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..7536faaaa0fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;

- /* We're only interested __GFP_FS allocations for now */ - if (!(gfp_mask & __GFP_FS)) - return false; - if (gfp_mask & __GFP_NOLOCKDEP) return false;

@@ -4158,15 +4155,25 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_acquire(); + if (__need_reclaim(gfp_mask)) { + if (gfp_mask & __GFP_FS) + __fs_reclaim_acquire(); + +#ifdef CONFIG_MMU_NOTIFIER + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); +#endif + + } } EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) { - if (__need_fs_reclaim(gfp_mask)) - __fs_reclaim_release(); + if (__need_reclaim(gfp_mask)) { + if (gfp_mask & __GFP_FS) + __fs_reclaim_release(); + } } EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif

-- 2.26.2

Jason Gunthorpe

11 Jun 11 Jun

2:29 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...

fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever. -Daniel

I'm still not totally clear on how all the GFP flags map to different behaviors, but this seems plausible to me

At this point it should go through Andrew's tree, thanks

Acked-by: Jason Gunthorpe jgg@mellanox.com # For mmu_notifiers

Jason

Qian Cai

21 Jun 21 Jun

5:42 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...

fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

[ 190.455003][ T369] WARNING: possible circular locking dependency detected [ 190.487291][ T369] 5.8.0-rc1-next-20200621 #1 Not tainted [ 190.512363][ T369] ------------------------------------------------------ [ 190.543354][ T369] kswapd3/369 is trying to acquire lock: [ 190.568523][ T369] ffff889fcf694528 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_reclaim_inode+0xdf/0x860 spin_lock at include/linux/spinlock.h:353 (inlined by) xfs_iflags_test_and_set at fs/xfs/xfs_inode.h:166 (inlined by) xfs_iflock_nowait at fs/xfs/xfs_inode.h:249 (inlined by) xfs_reclaim_inode at fs/xfs/xfs_icache.c:1127 [ 190.614359][ T369] [ 190.614359][ T369] but task is already holding lock: [ 190.647763][ T369] ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 __fs_reclaim_acquire at mm/page_alloc.c:4200 [ 190.687845][ T369] [ 190.687845][ T369] which lock already depends on the new lock. [ 190.687845][ T369] [ 190.734890][ T369] [ 190.734890][ T369] the existing dependency chain (in reverse order) is: [ 190.775991][ T369] [ 190.775991][ T369] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 190.808150][ T369] fs_reclaim_acquire+0x77/0x80 [ 190.832152][ T369] slab_pre_alloc_hook.constprop.52+0x20/0x120 slab_pre_alloc_hook at mm/slab.h:507 [ 190.862173][ T369] kmem_cache_alloc+0x43/0x2a0 [ 190.885602][ T369] kmem_zone_alloc+0x113/0x3ef kmem_zone_alloc at fs/xfs/kmem.c:129 [ 190.908702][ T369] xfs_inode_item_init+0x1d/0xa0 xfs_inode_item_init at fs/xfs/xfs_inode_item.c:639 [ 190.934461][ T369] xfs_trans_ijoin+0x96/0x100 xfs_trans_ijoin at fs/xfs/libxfs/xfs_trans_inode.c:34 [ 190.961530][ T369] xfs_setattr_nonsize+0x1a6/0xcd0 xfs_setattr_nonsize at fs/xfs/xfs_iops.c:716 [ 190.987331][ T369] xfs_vn_setattr+0x133/0x160 xfs_vn_setattr at fs/xfs/xfs_iops.c:1081 [ 191.010476][ T369] notify_change+0x6c5/0xba1 notify_change at fs/attr.c:336 [ 191.033317][ T369] chmod_common+0x19b/0x390 [ 191.055770][ T369] ksys_fchmod+0x28/0x60 [ 191.077957][ T369] __x64_sys_fchmod+0x4e/0x70 [ 191.102767][ T369] do_syscall_64+0x5f/0x310 [ 191.125090][ T369] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 191.153749][ T369] [ 191.153749][ T369] -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}: [ 191.191267][ T369] __lock_acquire+0x2efc/0x4da0 [ 191.215974][ T369] lock_acquire+0x1ac/0xaf0 [ 191.238953][ T369] down_write_nested+0x92/0x150 [ 191.262955][ T369] xfs_reclaim_inode+0xdf/0x860 [ 191.287149][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 191.313291][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 [ 191.338357][ T369] super_cache_scan+0x2fd/0x430 [ 191.362354][ T369] do_shrink_slab+0x317/0x990 [ 191.385341][ T369] shrink_slab+0x3a8/0x4b0 [ 191.407214][ T369] shrink_node+0x49c/0x17b0 [ 191.429841][ T369] balance_pgdat+0x59c/0xed0 [ 191.455041][ T369] kswapd+0x5a4/0xc40 [ 191.477524][ T369] kthread+0x358/0x420 [ 191.499285][ T369] ret_from_fork+0x22/0x30 [ 191.521107][ T369] [ 191.521107][ T369] other info that might help us debug this: [ 191.521107][ T369] [ 191.567490][ T369] Possible unsafe locking scenario: [ 191.567490][ T369] [ 191.600947][ T369] CPU0 CPU1 [ 191.624808][ T369] ---- ---- [ 191.649236][ T369] lock(fs_reclaim); [ 191.667607][ T369] lock(&xfs_nondir_ilock_class); [ 191.702096][ T369] lock(fs_reclaim); [ 191.731243][ T369] lock(&xfs_nondir_ilock_class); [ 191.754025][ T369] [ 191.754025][ T369] *** DEADLOCK *** [ 191.754025][ T369] [ 191.791126][ T369] 4 locks held by kswapd3/369: [ 191.812198][ T369] #0: ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 [ 191.854319][ T369] #1: ffffffffb5074c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x219/0x4b0 [ 191.896043][ T369] #2: ffff8890279b40e0 (&type->s_umount_key#27){++++}-{3:3}, at: trylock_super+0x11/0xb0 [ 191.940538][ T369] #3: ffff889027a73a28 (&pag->pag_ici_reclaim_lock){+.+.}-{3:3}, at: xfs_reclaim_inodes_ag+0x135/0xb00 [ 191.995314][ T369] [ 191.995314][ T369] stack backtrace: [ 192.022934][ T369] CPU: 42 PID: 369 Comm: kswapd3 Not tainted 5.8.0-rc1-next-20200621 #1 [ 192.060546][ T369] Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018 [ 192.094518][ T369] Call Trace: [ 192.109005][ T369] dump_stack+0x9d/0xe0 [ 192.127468][ T369] check_noncircular+0x347/0x400 [ 192.149526][ T369] ? print_circular_bug+0x360/0x360 [ 192.172584][ T369] ? freezing_slow_path.cold.2+0x2a/0x2a [ 192.197251][ T369] __lock_acquire+0x2efc/0x4da0 [ 192.218737][ T369] ? lockdep_hardirqs_on_prepare+0x550/0x550 [ 192.246736][ T369] ? __lock_acquire+0x3541/0x4da0 [ 192.269673][ T369] lock_acquire+0x1ac/0xaf0 [ 192.290192][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.313158][ T369] ? rcu_read_unlock+0x50/0x50 [ 192.335057][ T369] down_write_nested+0x92/0x150 [ 192.358409][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.380890][ T369] ? rwsem_down_write_slowpath+0xf50/0xf50 [ 192.406891][ T369] ? find_held_lock+0x33/0x1c0 [ 192.427925][ T369] ? xfs_ilock+0x2ef/0x370 [ 192.447496][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.472315][ T369] xfs_reclaim_inode+0xdf/0x860 [ 192.496649][ T369] ? xfs_inode_clear_reclaim_tag+0xa0/0xa0 [ 192.524188][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.546852][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 192.570473][ T369] ? xfs_reclaim_inode+0x860/0x860 [ 192.592692][ T369] ? mark_held_locks+0xb0/0x110 [ 192.614287][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 192.640800][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 192.666695][ T369] ? try_to_wake_up+0xcf/0xf40 [ 192.688265][ T369] ? migrate_swap_stop+0xc10/0xc10 [ 192.711966][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.735032][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 xfs_reclaim_inodes_nr at fs/xfs/xfs_icache.c:1399 [ 192.757674][ T369] ? xfs_reclaim_inodes+0x90/0x90 [ 192.780028][ T369] ? list_lru_count_one+0x177/0x300 [ 192.803010][ T369] super_cache_scan+0x2fd/0x430 super_cache_scan at fs/super.c:115 [ 192.824491][ T369] do_shrink_slab+0x317/0x990 do_shrink_slab at mm/vmscan.c:514 [ 192.845160][ T369] shrink_slab+0x3a8/0x4b0 shrink_slab_memcg at mm/vmscan.c:584 (inlined by) shrink_slab at mm/vmscan.c:662 [ 192.864722][ T369] ? do_shrink_slab+0x990/0x990 [ 192.886137][ T369] ? rcu_is_watching+0x2c/0x80 [ 192.907289][ T369] ? mem_cgroup_protected+0x228/0x470 [ 192.931166][ T369] ? vmpressure+0x25/0x290 [ 192.950595][ T369] shrink_node+0x49c/0x17b0 [ 192.972332][ T369] balance_pgdat+0x59c/0xed0 kswapd_shrink_node at mm/vmscan.c:3521 (inlined by) balance_pgdat at mm/vmscan.c:3670 [ 192.994918][ T369] ? __node_reclaim+0x950/0x950 [ 193.018625][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 193.046566][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.070214][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.093176][ T369] ? finish_task_switch+0x129/0x650 [ 193.116225][ T369] ? finish_task_switch+0xf2/0x650 [ 193.138809][ T369] ? rcu_read_lock_bh_held+0xc0/0xc0 [ 193.163323][ T369] kswapd+0x5a4/0xc40 [ 193.182690][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.204660][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.225776][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 193.252306][ T369] ? finish_wait+0x270/0x270 [ 193.272473][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.294476][ T369] ? __kthread_parkme+0xcc/0x1a0 [ 193.316704][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.337808][ T369] kthread+0x358/0x420 [ 193.355666][ T369] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 193.381884][ T369] ret_from_fork+0x22/0x30

...

This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever.

-Daniel

mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 25 ++++++++++++++++--------- 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);
if (IS_ENABLED(CONFIG_LOCKDEP)) {
fs_reclaim_acquire(GFP_KERNEL);
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
fs_reclaim_release(GFP_KERNEL);
}

if (!mm->notifier_subscriptions) { /*

kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..7536faaaa0fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */

if (!(gfp_mask & __GFP_FS))
return false;
if (gfp_mask & __GFP_NOLOCKDEP) return false;
@@ -4158,15 +4155,25 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
__fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
if (gfp_mask & __GFP_FS)
	__fs_reclaim_acquire();
+#ifdef CONFIG_MMU_NOTIFIER
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+#endif

}

} EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
__fs_reclaim_release();
if (__need_reclaim(gfp_mask)) {
if (gfp_mask & __GFP_FS)
	__fs_reclaim_release();
}
} EXPORT_SYMBOL_GPL(fs_reclaim_release);

#endif

2.26.2

Daniel Vetter

6:07 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Sun, Jun 21, 2020 at 7:42 PM Qian Cai cai@lca.pw wrote:

...

On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

Hm, then I'm confused because - there's not mmut notifier lockdep map in the splat at a.. - the patch is supposed to not change anything for fs_reclaim (but the interim version got that wrong) - looking at the paths it's kmalloc vs kswapd, both places I totally expect fs_reflaim to be used.

But you're claiming reverting this prevents the lockdep splat. If that's right, then my reasoning above is broken somewhere. Someone less blind than me having an idea?

Aside this is the first email I've typed, until I realized the first report was against the broken patch and that looked like a much more reasonable explanation (but didn't quite match up with the code paths).

Thanks, Daniel

...

[ 190.455003][ T369] WARNING: possible circular locking dependency detected [ 190.487291][ T369] 5.8.0-rc1-next-20200621 #1 Not tainted [ 190.512363][ T369] ------------------------------------------------------ [ 190.543354][ T369] kswapd3/369 is trying to acquire lock: [ 190.568523][ T369] ffff889fcf694528 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_reclaim_inode+0xdf/0x860 spin_lock at include/linux/spinlock.h:353 (inlined by) xfs_iflags_test_and_set at fs/xfs/xfs_inode.h:166 (inlined by) xfs_iflock_nowait at fs/xfs/xfs_inode.h:249 (inlined by) xfs_reclaim_inode at fs/xfs/xfs_icache.c:1127 [ 190.614359][ T369] [ 190.614359][ T369] but task is already holding lock: [ 190.647763][ T369] ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 __fs_reclaim_acquire at mm/page_alloc.c:4200 [ 190.687845][ T369] [ 190.687845][ T369] which lock already depends on the new lock. [ 190.687845][ T369] [ 190.734890][ T369] [ 190.734890][ T369] the existing dependency chain (in reverse order) is: [ 190.775991][ T369] [ 190.775991][ T369] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 190.808150][ T369] fs_reclaim_acquire+0x77/0x80 [ 190.832152][ T369] slab_pre_alloc_hook.constprop.52+0x20/0x120 slab_pre_alloc_hook at mm/slab.h:507 [ 190.862173][ T369] kmem_cache_alloc+0x43/0x2a0 [ 190.885602][ T369] kmem_zone_alloc+0x113/0x3ef kmem_zone_alloc at fs/xfs/kmem.c:129 [ 190.908702][ T369] xfs_inode_item_init+0x1d/0xa0 xfs_inode_item_init at fs/xfs/xfs_inode_item.c:639 [ 190.934461][ T369] xfs_trans_ijoin+0x96/0x100 xfs_trans_ijoin at fs/xfs/libxfs/xfs_trans_inode.c:34 [ 190.961530][ T369] xfs_setattr_nonsize+0x1a6/0xcd0 xfs_setattr_nonsize at fs/xfs/xfs_iops.c:716 [ 190.987331][ T369] xfs_vn_setattr+0x133/0x160 xfs_vn_setattr at fs/xfs/xfs_iops.c:1081 [ 191.010476][ T369] notify_change+0x6c5/0xba1 notify_change at fs/attr.c:336 [ 191.033317][ T369] chmod_common+0x19b/0x390 [ 191.055770][ T369] ksys_fchmod+0x28/0x60 [ 191.077957][ T369] __x64_sys_fchmod+0x4e/0x70 [ 191.102767][ T369] do_syscall_64+0x5f/0x310 [ 191.125090][ T369] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 191.153749][ T369] [ 191.153749][ T369] -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}: [ 191.191267][ T369] __lock_acquire+0x2efc/0x4da0 [ 191.215974][ T369] lock_acquire+0x1ac/0xaf0 [ 191.238953][ T369] down_write_nested+0x92/0x150 [ 191.262955][ T369] xfs_reclaim_inode+0xdf/0x860 [ 191.287149][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 191.313291][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 [ 191.338357][ T369] super_cache_scan+0x2fd/0x430 [ 191.362354][ T369] do_shrink_slab+0x317/0x990 [ 191.385341][ T369] shrink_slab+0x3a8/0x4b0 [ 191.407214][ T369] shrink_node+0x49c/0x17b0 [ 191.429841][ T369] balance_pgdat+0x59c/0xed0 [ 191.455041][ T369] kswapd+0x5a4/0xc40 [ 191.477524][ T369] kthread+0x358/0x420 [ 191.499285][ T369] ret_from_fork+0x22/0x30 [ 191.521107][ T369] [ 191.521107][ T369] other info that might help us debug this: [ 191.521107][ T369] [ 191.567490][ T369] Possible unsafe locking scenario: [ 191.567490][ T369] [ 191.600947][ T369] CPU0 CPU1 [ 191.624808][ T369] ---- ---- [ 191.649236][ T369] lock(fs_reclaim); [ 191.667607][ T369] lock(&xfs_nondir_ilock_class); [ 191.702096][ T369] lock(fs_reclaim); [ 191.731243][ T369] lock(&xfs_nondir_ilock_class); [ 191.754025][ T369] [ 191.754025][ T369] *** DEADLOCK *** [ 191.754025][ T369] [ 191.791126][ T369] 4 locks held by kswapd3/369: [ 191.812198][ T369] #0: ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 [ 191.854319][ T369] #1: ffffffffb5074c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x219/0x4b0 [ 191.896043][ T369] #2: ffff8890279b40e0 (&type->s_umount_key#27){++++}-{3:3}, at: trylock_super+0x11/0xb0 [ 191.940538][ T369] #3: ffff889027a73a28 (&pag->pag_ici_reclaim_lock){+.+.}-{3:3}, at: xfs_reclaim_inodes_ag+0x135/0xb00 [ 191.995314][ T369] [ 191.995314][ T369] stack backtrace: [ 192.022934][ T369] CPU: 42 PID: 369 Comm: kswapd3 Not tainted 5.8.0-rc1-next-20200621 #1 [ 192.060546][ T369] Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018 [ 192.094518][ T369] Call Trace: [ 192.109005][ T369] dump_stack+0x9d/0xe0 [ 192.127468][ T369] check_noncircular+0x347/0x400 [ 192.149526][ T369] ? print_circular_bug+0x360/0x360 [ 192.172584][ T369] ? freezing_slow_path.cold.2+0x2a/0x2a [ 192.197251][ T369] __lock_acquire+0x2efc/0x4da0 [ 192.218737][ T369] ? lockdep_hardirqs_on_prepare+0x550/0x550 [ 192.246736][ T369] ? __lock_acquire+0x3541/0x4da0 [ 192.269673][ T369] lock_acquire+0x1ac/0xaf0 [ 192.290192][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.313158][ T369] ? rcu_read_unlock+0x50/0x50 [ 192.335057][ T369] down_write_nested+0x92/0x150 [ 192.358409][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.380890][ T369] ? rwsem_down_write_slowpath+0xf50/0xf50 [ 192.406891][ T369] ? find_held_lock+0x33/0x1c0 [ 192.427925][ T369] ? xfs_ilock+0x2ef/0x370 [ 192.447496][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.472315][ T369] xfs_reclaim_inode+0xdf/0x860 [ 192.496649][ T369] ? xfs_inode_clear_reclaim_tag+0xa0/0xa0 [ 192.524188][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.546852][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 192.570473][ T369] ? xfs_reclaim_inode+0x860/0x860 [ 192.592692][ T369] ? mark_held_locks+0xb0/0x110 [ 192.614287][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 192.640800][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 192.666695][ T369] ? try_to_wake_up+0xcf/0xf40 [ 192.688265][ T369] ? migrate_swap_stop+0xc10/0xc10 [ 192.711966][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.735032][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 xfs_reclaim_inodes_nr at fs/xfs/xfs_icache.c:1399 [ 192.757674][ T369] ? xfs_reclaim_inodes+0x90/0x90 [ 192.780028][ T369] ? list_lru_count_one+0x177/0x300 [ 192.803010][ T369] super_cache_scan+0x2fd/0x430 super_cache_scan at fs/super.c:115 [ 192.824491][ T369] do_shrink_slab+0x317/0x990 do_shrink_slab at mm/vmscan.c:514 [ 192.845160][ T369] shrink_slab+0x3a8/0x4b0 shrink_slab_memcg at mm/vmscan.c:584 (inlined by) shrink_slab at mm/vmscan.c:662 [ 192.864722][ T369] ? do_shrink_slab+0x990/0x990 [ 192.886137][ T369] ? rcu_is_watching+0x2c/0x80 [ 192.907289][ T369] ? mem_cgroup_protected+0x228/0x470 [ 192.931166][ T369] ? vmpressure+0x25/0x290 [ 192.950595][ T369] shrink_node+0x49c/0x17b0 [ 192.972332][ T369] balance_pgdat+0x59c/0xed0 kswapd_shrink_node at mm/vmscan.c:3521 (inlined by) balance_pgdat at mm/vmscan.c:3670 [ 192.994918][ T369] ? __node_reclaim+0x950/0x950 [ 193.018625][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 193.046566][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.070214][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.093176][ T369] ? finish_task_switch+0x129/0x650 [ 193.116225][ T369] ? finish_task_switch+0xf2/0x650 [ 193.138809][ T369] ? rcu_read_lock_bh_held+0xc0/0xc0 [ 193.163323][ T369] kswapd+0x5a4/0xc40 [ 193.182690][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.204660][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.225776][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 193.252306][ T369] ? finish_wait+0x270/0x270 [ 193.272473][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.294476][ T369] ? __kthread_parkme+0xcc/0x1a0 [ 193.316704][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.337808][ T369] kthread+0x358/0x420 [ 193.355666][ T369] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 193.381884][ T369] ret_from_fork+0x22/0x30

...
This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever.

-Daniel

mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 25 ++++++++++++++++--------- 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);
if (IS_ENABLED(CONFIG_LOCKDEP)) {
        fs_reclaim_acquire(GFP_KERNEL);
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
        fs_reclaim_release(GFP_KERNEL);
}
if (!mm->notifier_subscriptions) {
        /*
         * kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..7536faaaa0fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */
if (!(gfp_mask & __GFP_FS))
        return false;
if (gfp_mask & __GFP_NOLOCKDEP)
        return false;
@@ -4158,15 +4155,25 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
        if (gfp_mask & __GFP_FS)
                __fs_reclaim_acquire();
+#ifdef CONFIG_MMU_NOTIFIER
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+#endif
}
} EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_release();
if (__need_reclaim(gfp_mask)) {
        if (gfp_mask & __GFP_FS)
                __fs_reclaim_release();
}
} EXPORT_SYMBOL_GPL(fs_reclaim_release);

#endif

2.26.2

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

8:01 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Sun, Jun 21, 2020 at 08:07:08PM +0200, Daniel Vetter wrote:

...

On Sun, Jun 21, 2020 at 7:42 PM Qian Cai cai@lca.pw wrote:

...
On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

Hm, then I'm confused because

there's not mmut notifier lockdep map in the splat at a..

the patch is supposed to not change anything for fs_reclaim (but the

interim version got that wrong)

looking at the paths it's kmalloc vs kswapd, both places I totally

expect fs_reflaim to be used.

But you're claiming reverting this prevents the lockdep splat. If that's right, then my reasoning above is broken somewhere. Someone less blind than me having an idea?

Aside this is the first email I've typed, until I realized the first report was against the broken patch and that looked like a much more reasonable explanation (but didn't quite match up with the code paths).

Below diff should undo the functional change in my patch. Can you pls test whether the lockdep splat is really gone with that? Might need a lot of testing and memory pressure to be sure, since all these reclaim paths aren't very deterministic. -Daniel

--- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d807587c9ae6..27ea763c6155 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4191,11 +4191,6 @@ void fs_reclaim_acquire(gfp_t gfp_mask) if (gfp_mask & __GFP_FS) __fs_reclaim_acquire();

-#ifdef CONFIG_MMU_NOTIFIER - lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); - lock_map_release(&__mmu_notifier_invalidate_range_start_map); -#endif - } } EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Qian Cai

10:09 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Sun, Jun 21, 2020 at 10:01:03PM +0200, Daniel Vetter wrote:

...

On Sun, Jun 21, 2020 at 08:07:08PM +0200, Daniel Vetter wrote:

...
On Sun, Jun 21, 2020 at 7:42 PM Qian Cai cai@lca.pw wrote:

...
On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

Hm, then I'm confused because

there's not mmut notifier lockdep map in the splat at a..

the patch is supposed to not change anything for fs_reclaim (but the

interim version got that wrong)

looking at the paths it's kmalloc vs kswapd, both places I totally

expect fs_reflaim to be used.

But you're claiming reverting this prevents the lockdep splat. If that's right, then my reasoning above is broken somewhere. Someone less blind than me having an idea?

Aside this is the first email I've typed, until I realized the first report was against the broken patch and that looked like a much more reasonable explanation (but didn't quite match up with the code paths).

Below diff should undo the functional change in my patch. Can you pls test whether the lockdep splat is really gone with that? Might need a lot of testing and memory pressure to be sure, since all these reclaim paths aren't very deterministic.

Well, I am running even heavy memory pressure workloads on linux-next like every day, and never saw this splat until today where your patch first show up.

Since I am rather busy tracking another regression, here is the steps to reproduce (super easy to reproduce on multiple machines here.):

# git clone https://github.com/cailca/linux-mm.git # cd linux-mm; make # ./random 0

The .config is in there as well if ever matters.

Qian Cai

23 Jun 23 Jun

4:17 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Sun, Jun 21, 2020 at 10:01:03PM +0200, Daniel Vetter wrote:

...

On Sun, Jun 21, 2020 at 08:07:08PM +0200, Daniel Vetter wrote:

...
On Sun, Jun 21, 2020 at 7:42 PM Qian Cai cai@lca.pw wrote:

...
On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

Hm, then I'm confused because

there's not mmut notifier lockdep map in the splat at a..

the patch is supposed to not change anything for fs_reclaim (but the

interim version got that wrong)

looking at the paths it's kmalloc vs kswapd, both places I totally

expect fs_reflaim to be used.

But you're claiming reverting this prevents the lockdep splat. If that's right, then my reasoning above is broken somewhere. Someone less blind than me having an idea?

Aside this is the first email I've typed, until I realized the first report was against the broken patch and that looked like a much more reasonable explanation (but didn't quite match up with the code paths).

Below diff should undo the functional change in my patch. Can you pls test whether the lockdep splat is really gone with that? Might need a lot of testing and memory pressure to be sure, since all these reclaim paths aren't very deterministic.

No, this patch does not help but reverting the whole patch still fixed the splat.

...

-Daniel

diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d807587c9ae6..27ea763c6155 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4191,11 +4191,6 @@ void fs_reclaim_acquire(gfp_t gfp_mask) if (gfp_mask & __GFP_FS) __fs_reclaim_acquire();

-#ifdef CONFIG_MMU_NOTIFIER
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
-#endif

}

} EXPORT_SYMBOL_GPL(fs_reclaim_acquire); -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

10:13 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Tue, Jun 23, 2020 at 6:18 PM Qian Cai cai@lca.pw wrote:

...

On Sun, Jun 21, 2020 at 10:01:03PM +0200, Daniel Vetter wrote:

...
On Sun, Jun 21, 2020 at 08:07:08PM +0200, Daniel Vetter wrote:

...
On Sun, Jun 21, 2020 at 7:42 PM Qian Cai cai@lca.pw wrote:

...
On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

Hm, then I'm confused because

there's not mmut notifier lockdep map in the splat at a..

the patch is supposed to not change anything for fs_reclaim (but the

interim version got that wrong)

looking at the paths it's kmalloc vs kswapd, both places I totally

expect fs_reflaim to be used.

But you're claiming reverting this prevents the lockdep splat. If that's right, then my reasoning above is broken somewhere. Someone less blind than me having an idea?

Aside this is the first email I've typed, until I realized the first report was against the broken patch and that looked like a much more reasonable explanation (but didn't quite match up with the code paths).

Below diff should undo the functional change in my patch. Can you pls test whether the lockdep splat is really gone with that? Might need a lot of testing and memory pressure to be sure, since all these reclaim paths aren't very deterministic.

No, this patch does not help but reverting the whole patch still fixed the splat.

Ok I tested this. I can't use your script to repro because - I don't have a setup with xfs, and the splat points at an issue in xfs - reproducing lockdep splats in shrinker callbacks is always a bit tricky

So instead I made a quick test to validate whether the fs_reclaim annotations work correctly, and nothing has changed:

+ printk("GFP_NOFS block\n"); + fs_reclaim_acquire(GFP_NOFS); + printk("allocate atomic\n"); + kfree(kmalloc(16, GFP_ATOMIC)); + printk("allocate noio\n"); + kfree(kmalloc(16, GFP_NOIO));

The below two calls to kmalloc are wrong, but the current annotations don't track __GFP_IO and other levels, only __GFP_FS. So no lockdep splats here.

+ printk("allocate nofs\n"); + kfree(kmalloc(16, GFP_NOFS)); + printk("allocate kernel\n"); + kfree(kmalloc(16, GFP_KERNEL)); + fs_reclaim_release(GFP_NOFS); + + + printk("GFP_KERNEL block\n"); + fs_reclaim_acquire(GFP_KERNEL); + printk("allocate atomic\n"); + kfree(kmalloc(16, GFP_ATOMIC)); + printk("allocate noio\n"); + kfree(kmalloc(16, GFP_NOIO)); + printk("allocate nofs\n"); + kfree(kmalloc(16, GFP_NOFS));

This allocation is buggy, and should splat. This is the case for both with my patch, and with my patch reverted.

+ printk("allocate kernel\n"); + kfree(kmalloc(16, GFP_KERNEL)); + fs_reclaim_release(GFP_KERNEL);

I also looked at the paths in your lockdep splat in xfs, this is simply GFP_KERNEL vs a shrinker reclaim in kswapd.

Summary: Everything is working as expected, there's no change in the lockdep annotations.

I really think the problem is that either your testcase doesn't hit the issue reliably enough, or that you're not actually testing the same kernels and there's some other changes (xfs most likely, but really it could be anywhere) which is causing this regression. I'm rather convinced now after this test that it's not my stuff.

Thanks, Daniel

...

...
-Daniel

diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d807587c9ae6..27ea763c6155 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4191,11 +4191,6 @@ void fs_reclaim_acquire(gfp_t gfp_mask) if (gfp_mask & __GFP_FS) __fs_reclaim_acquire();

-#ifdef CONFIG_MMU_NOTIFIER
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
-#endif
}
} EXPORT_SYMBOL_GPL(fs_reclaim_acquire); -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Qian Cai

10:29 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

...

On Jun 23, 2020, at 6:13 PM, Daniel Vetter daniel@ffwll.ch wrote:

Ok I tested this. I can't use your script to repro because

I don't have a setup with xfs, and the splat points at an issue in xfs

reproducing lockdep splats in shrinker callbacks is always a bit tricky

What’s xfs setup are you talking about? This is simple xfs rootfs and then trigger swapping. Nothing tricky here as it hit on multiple machines within a few seconds on linux-next.

...

Summary: Everything is working as expected, there's no change in the lockdep annotations. I really think the problem is that either your testcase doesn't hit the issue reliably enough, or that you're not actually testing the same kernels and there's some other changes (xfs most likely, but really it could be anywhere) which is causing this regression. I'm rather convinced now after this test that it's not my stuff.

Well, the memory pressure workloads have been running for years on daily linux-next builds and never saw this one happened once. Also, the reverting is ONLY to revert your patch on the top of linux-next will stop the splat, so there is no not testing the same kernel at all.

Dave Chinner

10:31 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Sun, Jun 21, 2020 at 01:42:05PM -0400, Qian Cai wrote:

...

On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

[ 190.455003][ T369] WARNING: possible circular locking dependency detected [ 190.487291][ T369] 5.8.0-rc1-next-20200621 #1 Not tainted [ 190.512363][ T369] ------------------------------------------------------ [ 190.543354][ T369] kswapd3/369 is trying to acquire lock: [ 190.568523][ T369] ffff889fcf694528 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_reclaim_inode+0xdf/0x860 spin_lock at include/linux/spinlock.h:353 (inlined by) xfs_iflags_test_and_set at fs/xfs/xfs_inode.h:166 (inlined by) xfs_iflock_nowait at fs/xfs/xfs_inode.h:249 (inlined by) xfs_reclaim_inode at fs/xfs/xfs_icache.c:1127 [ 190.614359][ T369] [ 190.614359][ T369] but task is already holding lock: [ 190.647763][ T369] ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 __fs_reclaim_acquire at mm/page_alloc.c:4200 [ 190.687845][ T369] [ 190.687845][ T369] which lock already depends on the new lock. [ 190.687845][ T369] [ 190.734890][ T369] [ 190.734890][ T369] the existing dependency chain (in reverse order) is: [ 190.775991][ T369] [ 190.775991][ T369] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 190.808150][ T369] fs_reclaim_acquire+0x77/0x80 [ 190.832152][ T369] slab_pre_alloc_hook.constprop.52+0x20/0x120 slab_pre_alloc_hook at mm/slab.h:507 [ 190.862173][ T369] kmem_cache_alloc+0x43/0x2a0 [ 190.885602][ T369] kmem_zone_alloc+0x113/0x3ef kmem_zone_alloc at fs/xfs/kmem.c:129 [ 190.908702][ T369] xfs_inode_item_init+0x1d/0xa0 xfs_inode_item_init at fs/xfs/xfs_inode_item.c:639 [ 190.934461][ T369] xfs_trans_ijoin+0x96/0x100 xfs_trans_ijoin at fs/xfs/libxfs/xfs_trans_inode.c:34 [ 190.961530][ T369] xfs_setattr_nonsize+0x1a6/0xcd0

OK, this patch has royally screwed something up if this path thinks it can enter memory reclaim. This path is inside a transaction, so it is running under PF_MEMALLOC_NOFS context, so should *never* enter memory reclaim.

I'd suggest that whatever mods were made to fs_reclaim_acquire by this patch broke it's basic functionality....

...

...
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..7536faaaa0fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

This is applies the per-task memory allocation context flags to the mask that is checked here.

...

...
@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */

if (!(gfp_mask & __GFP_FS))
return false;
if (gfp_mask & __GFP_NOLOCKDEP) return false;
@@ -4158,15 +4155,25 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
__fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
if (gfp_mask & __GFP_FS)
	__fs_reclaim_acquire();

.... and they have not been applied in this path. There's your breakage.

For future reference, please post anything that changes NOFS allocation contexts or behaviours to linux-fsdevel, as filesystem developers need to know about proposed changes to infrastructure that is critical to the correct functioning of filesystems...

Cheers,

Dave.

-- Dave Chinner david@fromorbit.com

Daniel Vetter

10:36 p.m.

New subject: [PATCH] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Wed, Jun 24, 2020 at 12:31 AM Dave Chinner david@fromorbit.com wrote:

...

On Sun, Jun 21, 2020 at 01:42:05PM -0400, Qian Cai wrote:

...
On Wed, Jun 10, 2020 at 09:41:01PM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

v2: Review from Thomas Hellstrom:

unbotch the fs_reclaim context check, I accidentally inverted it, but it didn't blow up because I inverted it immediately

fix compiling for !CONFIG_MMU_NOTIFIER

Cc: Thomas Hellström (Intel) thomas_os@shipmail.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Replying the right patch here...

Reverting this commit [1] fixed the lockdep warning below while applying some memory pressure.

[1] linux-next cbf7c9d86d75 ("mm: track mmu notifiers in fs_reclaim_acquire/release")

[ 190.455003][ T369] WARNING: possible circular locking dependency detected [ 190.487291][ T369] 5.8.0-rc1-next-20200621 #1 Not tainted [ 190.512363][ T369] ------------------------------------------------------ [ 190.543354][ T369] kswapd3/369 is trying to acquire lock: [ 190.568523][ T369] ffff889fcf694528 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_reclaim_inode+0xdf/0x860 spin_lock at include/linux/spinlock.h:353 (inlined by) xfs_iflags_test_and_set at fs/xfs/xfs_inode.h:166 (inlined by) xfs_iflock_nowait at fs/xfs/xfs_inode.h:249 (inlined by) xfs_reclaim_inode at fs/xfs/xfs_icache.c:1127 [ 190.614359][ T369] [ 190.614359][ T369] but task is already holding lock: [ 190.647763][ T369] ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 __fs_reclaim_acquire at mm/page_alloc.c:4200 [ 190.687845][ T369] [ 190.687845][ T369] which lock already depends on the new lock. [ 190.687845][ T369] [ 190.734890][ T369] [ 190.734890][ T369] the existing dependency chain (in reverse order) is: [ 190.775991][ T369] [ 190.775991][ T369] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 190.808150][ T369] fs_reclaim_acquire+0x77/0x80 [ 190.832152][ T369] slab_pre_alloc_hook.constprop.52+0x20/0x120 slab_pre_alloc_hook at mm/slab.h:507 [ 190.862173][ T369] kmem_cache_alloc+0x43/0x2a0 [ 190.885602][ T369] kmem_zone_alloc+0x113/0x3ef kmem_zone_alloc at fs/xfs/kmem.c:129 [ 190.908702][ T369] xfs_inode_item_init+0x1d/0xa0 xfs_inode_item_init at fs/xfs/xfs_inode_item.c:639 [ 190.934461][ T369] xfs_trans_ijoin+0x96/0x100 xfs_trans_ijoin at fs/xfs/libxfs/xfs_trans_inode.c:34 [ 190.961530][ T369] xfs_setattr_nonsize+0x1a6/0xcd0

OK, this patch has royally screwed something up if this path thinks it can enter memory reclaim. This path is inside a transaction, so it is running under PF_MEMALLOC_NOFS context, so should *never* enter memory reclaim.

I'd suggest that whatever mods were made to fs_reclaim_acquire by this patch broke it's basic functionality....

...
...
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..7536faaaa0fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

This is applies the per-task memory allocation context flags to the mask that is checked here.

...
...
@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */

if (!(gfp_mask & __GFP_FS))
      return false;
if (gfp_mask & __GFP_NOLOCKDEP) return false;
@@ -4158,15 +4155,25 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
      __fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
      if (gfp_mask & __GFP_FS)
              __fs_reclaim_acquire();
.... and they have not been applied in this path. There's your breakage.

For future reference, please post anything that changes NOFS allocation contexts or behaviours to linux-fsdevel, as filesystem developers need to know about proposed changes to infrastructure that is critical to the correct functioning of filesystems...

Uh crap I totally missed that. Apologies for wasting everyone's time here.

Andrew, please drop for now, I respin this thing. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Qian Cai

21 Jun 21 Jun

5 p.m.

New subject: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Thu, Jun 04, 2020 at 10:12:07AM +0200, Daniel Vetter wrote:

...

fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Reverting this commit fixed the lockdep splat below while applying some memory pressure,

...

This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever.

-Daniel

mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 23 ++++++++++++++--------- 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);
if (IS_ENABLED(CONFIG_LOCKDEP)) {
fs_reclaim_acquire(GFP_KERNEL);
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
fs_reclaim_release(GFP_KERNEL);
}

if (!mm->notifier_subscriptions) { /*

kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */

if (!(gfp_mask & __GFP_FS))
return false;
if (gfp_mask & __GFP_NOLOCKDEP) return false;
@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
__fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
if (!(gfp_mask & __GFP_FS))
	__fs_reclaim_acquire();
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
}
} EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
__fs_reclaim_release();
if (__need_reclaim(gfp_mask)) {
if (!(gfp_mask & __GFP_FS))
	__fs_reclaim_release();
}
} EXPORT_SYMBOL_GPL(fs_reclaim_release);

#endif

2.26.2

Daniel Vetter

5:28 p.m.

New subject: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Sun, Jun 21, 2020 at 7:01 PM Qian Cai cai@lca.pw wrote:

...

On Thu, Jun 04, 2020 at 10:12:07AM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Reverting this commit fixed the lockdep splat below while applying some memory pressure,

This is a broken version of the patch, please use the one Andrew merged into -mm.

Thanks, Daniel

...

[ 190.455003][ T369] WARNING: possible circular locking dependency detected [ 190.487291][ T369] 5.8.0-rc1-next-20200621 #1 Not tainted [ 190.512363][ T369] ------------------------------------------------------ [ 190.543354][ T369] kswapd3/369 is trying to acquire lock: [ 190.568523][ T369] ffff889fcf694528 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_reclaim_inode+0xdf/0x860 spin_lock at include/linux/spinlock.h:353 (inlined by) xfs_iflags_test_and_set at fs/xfs/xfs_inode.h:166 (inlined by) xfs_iflock_nowait at fs/xfs/xfs_inode.h:249 (inlined by) xfs_reclaim_inode at fs/xfs/xfs_icache.c:1127 [ 190.614359][ T369] [ 190.614359][ T369] but task is already holding lock: [ 190.647763][ T369] ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 __fs_reclaim_acquire at mm/page_alloc.c:4200 [ 190.687845][ T369] [ 190.687845][ T369] which lock already depends on the new lock. [ 190.687845][ T369] [ 190.734890][ T369] [ 190.734890][ T369] the existing dependency chain (in reverse order) is: [ 190.775991][ T369] [ 190.775991][ T369] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 190.808150][ T369] fs_reclaim_acquire+0x77/0x80 [ 190.832152][ T369] slab_pre_alloc_hook.constprop.52+0x20/0x120 slab_pre_alloc_hook at mm/slab.h:507 [ 190.862173][ T369] kmem_cache_alloc+0x43/0x2a0 [ 190.885602][ T369] kmem_zone_alloc+0x113/0x3ef kmem_zone_alloc at fs/xfs/kmem.c:129 [ 190.908702][ T369] xfs_inode_item_init+0x1d/0xa0 xfs_inode_item_init at fs/xfs/xfs_inode_item.c:639 [ 190.934461][ T369] xfs_trans_ijoin+0x96/0x100 xfs_trans_ijoin at fs/xfs/libxfs/xfs_trans_inode.c:34 [ 190.961530][ T369] xfs_setattr_nonsize+0x1a6/0xcd0 xfs_setattr_nonsize at fs/xfs/xfs_iops.c:716 [ 190.987331][ T369] xfs_vn_setattr+0x133/0x160 xfs_vn_setattr at fs/xfs/xfs_iops.c:1081 [ 191.010476][ T369] notify_change+0x6c5/0xba1 notify_change at fs/attr.c:336 [ 191.033317][ T369] chmod_common+0x19b/0x390 [ 191.055770][ T369] ksys_fchmod+0x28/0x60 [ 191.077957][ T369] __x64_sys_fchmod+0x4e/0x70 [ 191.102767][ T369] do_syscall_64+0x5f/0x310 [ 191.125090][ T369] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 191.153749][ T369] [ 191.153749][ T369] -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}: [ 191.191267][ T369] __lock_acquire+0x2efc/0x4da0 [ 191.215974][ T369] lock_acquire+0x1ac/0xaf0 [ 191.238953][ T369] down_write_nested+0x92/0x150 [ 191.262955][ T369] xfs_reclaim_inode+0xdf/0x860 [ 191.287149][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 191.313291][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 [ 191.338357][ T369] super_cache_scan+0x2fd/0x430 [ 191.362354][ T369] do_shrink_slab+0x317/0x990 [ 191.385341][ T369] shrink_slab+0x3a8/0x4b0 [ 191.407214][ T369] shrink_node+0x49c/0x17b0 [ 191.429841][ T369] balance_pgdat+0x59c/0xed0 [ 191.455041][ T369] kswapd+0x5a4/0xc40 [ 191.477524][ T369] kthread+0x358/0x420 [ 191.499285][ T369] ret_from_fork+0x22/0x30 [ 191.521107][ T369] [ 191.521107][ T369] other info that might help us debug this: [ 191.521107][ T369] [ 191.567490][ T369] Possible unsafe locking scenario: [ 191.567490][ T369] [ 191.600947][ T369] CPU0 CPU1 [ 191.624808][ T369] ---- ---- [ 191.649236][ T369] lock(fs_reclaim); [ 191.667607][ T369] lock(&xfs_nondir_ilock_class); [ 191.702096][ T369] lock(fs_reclaim); [ 191.731243][ T369] lock(&xfs_nondir_ilock_class); [ 191.754025][ T369] [ 191.754025][ T369] *** DEADLOCK *** [ 191.754025][ T369] [ 191.791126][ T369] 4 locks held by kswapd3/369: [ 191.812198][ T369] #0: ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 [ 191.854319][ T369] #1: ffffffffb5074c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x219/0x4b0 [ 191.896043][ T369] #2: ffff8890279b40e0 (&type->s_umount_key#27){++++}-{3:3}, at: trylock_super+0x11/0xb0 [ 191.940538][ T369] #3: ffff889027a73a28 (&pag->pag_ici_reclaim_lock){+.+.}-{3:3}, at: xfs_reclaim_inodes_ag+0x135/0xb00 [ 191.995314][ T369] [ 191.995314][ T369] stack backtrace: [ 192.022934][ T369] CPU: 42 PID: 369 Comm: kswapd3 Not tainted 5.8.0-rc1-next-20200621 #1 [ 192.060546][ T369] Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018 [ 192.094518][ T369] Call Trace: [ 192.109005][ T369] dump_stack+0x9d/0xe0 [ 192.127468][ T369] check_noncircular+0x347/0x400 [ 192.149526][ T369] ? print_circular_bug+0x360/0x360 [ 192.172584][ T369] ? freezing_slow_path.cold.2+0x2a/0x2a [ 192.197251][ T369] __lock_acquire+0x2efc/0x4da0 [ 192.218737][ T369] ? lockdep_hardirqs_on_prepare+0x550/0x550 [ 192.246736][ T369] ? __lock_acquire+0x3541/0x4da0 [ 192.269673][ T369] lock_acquire+0x1ac/0xaf0 [ 192.290192][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.313158][ T369] ? rcu_read_unlock+0x50/0x50 [ 192.335057][ T369] down_write_nested+0x92/0x150 [ 192.358409][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.380890][ T369] ? rwsem_down_write_slowpath+0xf50/0xf50 [ 192.406891][ T369] ? find_held_lock+0x33/0x1c0 [ 192.427925][ T369] ? xfs_ilock+0x2ef/0x370 [ 192.447496][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.472315][ T369] xfs_reclaim_inode+0xdf/0x860 [ 192.496649][ T369] ? xfs_inode_clear_reclaim_tag+0xa0/0xa0 [ 192.524188][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.546852][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 192.570473][ T369] ? xfs_reclaim_inode+0x860/0x860 [ 192.592692][ T369] ? mark_held_locks+0xb0/0x110 [ 192.614287][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 192.640800][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 192.666695][ T369] ? try_to_wake_up+0xcf/0xf40 [ 192.688265][ T369] ? migrate_swap_stop+0xc10/0xc10 [ 192.711966][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.735032][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 xfs_reclaim_inodes_nr at fs/xfs/xfs_icache.c:1399 [ 192.757674][ T369] ? xfs_reclaim_inodes+0x90/0x90 [ 192.780028][ T369] ? list_lru_count_one+0x177/0x300 [ 192.803010][ T369] super_cache_scan+0x2fd/0x430 super_cache_scan at fs/super.c:115 [ 192.824491][ T369] do_shrink_slab+0x317/0x990 do_shrink_slab at mm/vmscan.c:514 [ 192.845160][ T369] shrink_slab+0x3a8/0x4b0 shrink_slab_memcg at mm/vmscan.c:584 (inlined by) shrink_slab at mm/vmscan.c:662 [ 192.864722][ T369] ? do_shrink_slab+0x990/0x990 [ 192.886137][ T369] ? rcu_is_watching+0x2c/0x80 [ 192.907289][ T369] ? mem_cgroup_protected+0x228/0x470 [ 192.931166][ T369] ? vmpressure+0x25/0x290 [ 192.950595][ T369] shrink_node+0x49c/0x17b0 [ 192.972332][ T369] balance_pgdat+0x59c/0xed0 kswapd_shrink_node at mm/vmscan.c:3521 (inlined by) balance_pgdat at mm/vmscan.c:3670 [ 192.994918][ T369] ? __node_reclaim+0x950/0x950 [ 193.018625][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 193.046566][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.070214][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.093176][ T369] ? finish_task_switch+0x129/0x650 [ 193.116225][ T369] ? finish_task_switch+0xf2/0x650 [ 193.138809][ T369] ? rcu_read_lock_bh_held+0xc0/0xc0 [ 193.163323][ T369] kswapd+0x5a4/0xc40 [ 193.182690][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.204660][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.225776][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 193.252306][ T369] ? finish_wait+0x270/0x270 [ 193.272473][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.294476][ T369] ? __kthread_parkme+0xcc/0x1a0 [ 193.316704][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.337808][ T369] kthread+0x358/0x420 [ 193.355666][ T369] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 193.381884][ T369] ret_from_fork+0x22/0x30

...
This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever.

-Daniel

mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 23 ++++++++++++++--------- 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);
if (IS_ENABLED(CONFIG_LOCKDEP)) {
        fs_reclaim_acquire(GFP_KERNEL);
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
        fs_reclaim_release(GFP_KERNEL);
}
if (!mm->notifier_subscriptions) {
        /*
         * kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */
if (!(gfp_mask & __GFP_FS))
        return false;
if (gfp_mask & __GFP_NOLOCKDEP)
        return false;
@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
        if (!(gfp_mask & __GFP_FS))
                __fs_reclaim_acquire();
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
}
} EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_release();
if (__need_reclaim(gfp_mask)) {
        if (!(gfp_mask & __GFP_FS))
                __fs_reclaim_release();
}
} EXPORT_SYMBOL_GPL(fs_reclaim_release);

#endif

2.26.2

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Qian Cai

5:46 p.m.

New subject: [PATCH 01/18] mm: Track mmu notifiers in fs_reclaim_acquire/release

On Sun, Jun 21, 2020 at 07:28:40PM +0200, Daniel Vetter wrote:

...

On Sun, Jun 21, 2020 at 7:01 PM Qian Cai cai@lca.pw wrote:

...
On Thu, Jun 04, 2020 at 10:12:07AM +0200, Daniel Vetter wrote:

...
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end").

But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion.

I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case.

Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same.

Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map.

With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful.

Cc: Andrew Morton akpm@linux-foundation.org Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Reverting this commit fixed the lockdep splat below while applying some memory pressure,

This is a broken version of the patch, please use the one Andrew merged into -mm.

Yes, since it is 5.8.0-rc1-next-20200621 which I believe it includes the latest version from -mm. Anyway, I replied again to your latest patch,

https://lore.kernel.org/lkml/20200621174205.GB1398@lca.pw/

...

Thanks, Daniel

...
[ 190.455003][ T369] WARNING: possible circular locking dependency detected [ 190.487291][ T369] 5.8.0-rc1-next-20200621 #1 Not tainted [ 190.512363][ T369] ------------------------------------------------------ [ 190.543354][ T369] kswapd3/369 is trying to acquire lock: [ 190.568523][ T369] ffff889fcf694528 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_reclaim_inode+0xdf/0x860 spin_lock at include/linux/spinlock.h:353 (inlined by) xfs_iflags_test_and_set at fs/xfs/xfs_inode.h:166 (inlined by) xfs_iflock_nowait at fs/xfs/xfs_inode.h:249 (inlined by) xfs_reclaim_inode at fs/xfs/xfs_icache.c:1127 [ 190.614359][ T369] [ 190.614359][ T369] but task is already holding lock: [ 190.647763][ T369] ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 __fs_reclaim_acquire at mm/page_alloc.c:4200 [ 190.687845][ T369] [ 190.687845][ T369] which lock already depends on the new lock. [ 190.687845][ T369] [ 190.734890][ T369] [ 190.734890][ T369] the existing dependency chain (in reverse order) is: [ 190.775991][ T369] [ 190.775991][ T369] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 190.808150][ T369] fs_reclaim_acquire+0x77/0x80 [ 190.832152][ T369] slab_pre_alloc_hook.constprop.52+0x20/0x120 slab_pre_alloc_hook at mm/slab.h:507 [ 190.862173][ T369] kmem_cache_alloc+0x43/0x2a0 [ 190.885602][ T369] kmem_zone_alloc+0x113/0x3ef kmem_zone_alloc at fs/xfs/kmem.c:129 [ 190.908702][ T369] xfs_inode_item_init+0x1d/0xa0 xfs_inode_item_init at fs/xfs/xfs_inode_item.c:639 [ 190.934461][ T369] xfs_trans_ijoin+0x96/0x100 xfs_trans_ijoin at fs/xfs/libxfs/xfs_trans_inode.c:34 [ 190.961530][ T369] xfs_setattr_nonsize+0x1a6/0xcd0 xfs_setattr_nonsize at fs/xfs/xfs_iops.c:716 [ 190.987331][ T369] xfs_vn_setattr+0x133/0x160 xfs_vn_setattr at fs/xfs/xfs_iops.c:1081 [ 191.010476][ T369] notify_change+0x6c5/0xba1 notify_change at fs/attr.c:336 [ 191.033317][ T369] chmod_common+0x19b/0x390 [ 191.055770][ T369] ksys_fchmod+0x28/0x60 [ 191.077957][ T369] __x64_sys_fchmod+0x4e/0x70 [ 191.102767][ T369] do_syscall_64+0x5f/0x310 [ 191.125090][ T369] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 191.153749][ T369] [ 191.153749][ T369] -> #0 (&xfs_nondir_ilock_class){++++}-{3:3}: [ 191.191267][ T369] __lock_acquire+0x2efc/0x4da0 [ 191.215974][ T369] lock_acquire+0x1ac/0xaf0 [ 191.238953][ T369] down_write_nested+0x92/0x150 [ 191.262955][ T369] xfs_reclaim_inode+0xdf/0x860 [ 191.287149][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 191.313291][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 [ 191.338357][ T369] super_cache_scan+0x2fd/0x430 [ 191.362354][ T369] do_shrink_slab+0x317/0x990 [ 191.385341][ T369] shrink_slab+0x3a8/0x4b0 [ 191.407214][ T369] shrink_node+0x49c/0x17b0 [ 191.429841][ T369] balance_pgdat+0x59c/0xed0 [ 191.455041][ T369] kswapd+0x5a4/0xc40 [ 191.477524][ T369] kthread+0x358/0x420 [ 191.499285][ T369] ret_from_fork+0x22/0x30 [ 191.521107][ T369] [ 191.521107][ T369] other info that might help us debug this: [ 191.521107][ T369] [ 191.567490][ T369] Possible unsafe locking scenario: [ 191.567490][ T369] [ 191.600947][ T369] CPU0 CPU1 [ 191.624808][ T369] ---- ---- [ 191.649236][ T369] lock(fs_reclaim); [ 191.667607][ T369] lock(&xfs_nondir_ilock_class); [ 191.702096][ T369] lock(fs_reclaim); [ 191.731243][ T369] lock(&xfs_nondir_ilock_class); [ 191.754025][ T369] [ 191.754025][ T369] *** DEADLOCK *** [ 191.754025][ T369] [ 191.791126][ T369] 4 locks held by kswapd3/369: [ 191.812198][ T369] #0: ffffffffb50ced00 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 [ 191.854319][ T369] #1: ffffffffb5074c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x219/0x4b0 [ 191.896043][ T369] #2: ffff8890279b40e0 (&type->s_umount_key#27){++++}-{3:3}, at: trylock_super+0x11/0xb0 [ 191.940538][ T369] #3: ffff889027a73a28 (&pag->pag_ici_reclaim_lock){+.+.}-{3:3}, at: xfs_reclaim_inodes_ag+0x135/0xb00 [ 191.995314][ T369] [ 191.995314][ T369] stack backtrace: [ 192.022934][ T369] CPU: 42 PID: 369 Comm: kswapd3 Not tainted 5.8.0-rc1-next-20200621 #1 [ 192.060546][ T369] Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018 [ 192.094518][ T369] Call Trace: [ 192.109005][ T369] dump_stack+0x9d/0xe0 [ 192.127468][ T369] check_noncircular+0x347/0x400 [ 192.149526][ T369] ? print_circular_bug+0x360/0x360 [ 192.172584][ T369] ? freezing_slow_path.cold.2+0x2a/0x2a [ 192.197251][ T369] __lock_acquire+0x2efc/0x4da0 [ 192.218737][ T369] ? lockdep_hardirqs_on_prepare+0x550/0x550 [ 192.246736][ T369] ? __lock_acquire+0x3541/0x4da0 [ 192.269673][ T369] lock_acquire+0x1ac/0xaf0 [ 192.290192][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.313158][ T369] ? rcu_read_unlock+0x50/0x50 [ 192.335057][ T369] down_write_nested+0x92/0x150 [ 192.358409][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.380890][ T369] ? rwsem_down_write_slowpath+0xf50/0xf50 [ 192.406891][ T369] ? find_held_lock+0x33/0x1c0 [ 192.427925][ T369] ? xfs_ilock+0x2ef/0x370 [ 192.447496][ T369] ? xfs_reclaim_inode+0xdf/0x860 [ 192.472315][ T369] xfs_reclaim_inode+0xdf/0x860 [ 192.496649][ T369] ? xfs_inode_clear_reclaim_tag+0xa0/0xa0 [ 192.524188][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.546852][ T369] xfs_reclaim_inodes_ag+0x505/0xb00 [ 192.570473][ T369] ? xfs_reclaim_inode+0x860/0x860 [ 192.592692][ T369] ? mark_held_locks+0xb0/0x110 [ 192.614287][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 192.640800][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 192.666695][ T369] ? try_to_wake_up+0xcf/0xf40 [ 192.688265][ T369] ? migrate_swap_stop+0xc10/0xc10 [ 192.711966][ T369] ? do_raw_spin_unlock+0x4f/0x250 [ 192.735032][ T369] xfs_reclaim_inodes_nr+0x93/0xd0 xfs_reclaim_inodes_nr at fs/xfs/xfs_icache.c:1399 [ 192.757674][ T369] ? xfs_reclaim_inodes+0x90/0x90 [ 192.780028][ T369] ? list_lru_count_one+0x177/0x300 [ 192.803010][ T369] super_cache_scan+0x2fd/0x430 super_cache_scan at fs/super.c:115 [ 192.824491][ T369] do_shrink_slab+0x317/0x990 do_shrink_slab at mm/vmscan.c:514 [ 192.845160][ T369] shrink_slab+0x3a8/0x4b0 shrink_slab_memcg at mm/vmscan.c:584 (inlined by) shrink_slab at mm/vmscan.c:662 [ 192.864722][ T369] ? do_shrink_slab+0x990/0x990 [ 192.886137][ T369] ? rcu_is_watching+0x2c/0x80 [ 192.907289][ T369] ? mem_cgroup_protected+0x228/0x470 [ 192.931166][ T369] ? vmpressure+0x25/0x290 [ 192.950595][ T369] shrink_node+0x49c/0x17b0 [ 192.972332][ T369] balance_pgdat+0x59c/0xed0 kswapd_shrink_node at mm/vmscan.c:3521 (inlined by) balance_pgdat at mm/vmscan.c:3670 [ 192.994918][ T369] ? __node_reclaim+0x950/0x950 [ 193.018625][ T369] ? lockdep_hardirqs_on_prepare+0x38c/0x550 [ 193.046566][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.070214][ T369] ? _raw_spin_unlock_irq+0x1f/0x30 [ 193.093176][ T369] ? finish_task_switch+0x129/0x650 [ 193.116225][ T369] ? finish_task_switch+0xf2/0x650 [ 193.138809][ T369] ? rcu_read_lock_bh_held+0xc0/0xc0 [ 193.163323][ T369] kswapd+0x5a4/0xc40 [ 193.182690][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.204660][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.225776][ T369] ? _raw_spin_unlock_irqrestore+0x39/0x40 [ 193.252306][ T369] ? finish_wait+0x270/0x270 [ 193.272473][ T369] ? __kthread_parkme+0x4d/0x1a0 [ 193.294476][ T369] ? __kthread_parkme+0xcc/0x1a0 [ 193.316704][ T369] ? balance_pgdat+0xed0/0xed0 [ 193.337808][ T369] kthread+0x358/0x420 [ 193.355666][ T369] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 193.381884][ T369] ret_from_fork+0x22/0x30

...
This is part of a gpu lockdep annotation series simply because it really helps to catch issues where gpu subsystem locks and primitives can deadlock with themselves through allocations and mmu notifiers. But aside from that motivation it should be completely free-standing, and can land through -mm/-rdma/-hmm or any other tree really whenever.

-Daniel

mm/mmu_notifier.c | 7 ------- mm/page_alloc.c | 23 ++++++++++++++--------- 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 06852b896fa6..5d578b9122f8 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -612,13 +612,6 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, lockdep_assert_held_write(&mm->mmap_sem); BUG_ON(atomic_read(&mm->mm_users) <= 0);
if (IS_ENABLED(CONFIG_LOCKDEP)) {
        fs_reclaim_acquire(GFP_KERNEL);
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
        fs_reclaim_release(GFP_KERNEL);
}
if (!mm->notifier_subscriptions) {
        /*
         * kmalloc cannot be called under mm_take_all_locks(), but we
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13cc653122b7..f8a222db4a53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <trace/events/oom.h> #include <linux/prefetch.h> #include <linux/mm_inline.h> +#include <linux/mmu_notifier.h> #include <linux/migrate.h> #include <linux/hugetlb.h> #include <linux/sched/rt.h> @@ -4124,7 +4125,7 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

-static bool __need_fs_reclaim(gfp_t gfp_mask) +static bool __need_reclaim(gfp_t gfp_mask) { gfp_mask = current_gfp_context(gfp_mask);

@@ -4136,10 +4137,6 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) if (current->flags & PF_MEMALLOC) return false;
/* We're only interested __GFP_FS allocations for now */
if (!(gfp_mask & __GFP_FS))
        return false;
if (gfp_mask & __GFP_NOLOCKDEP)
        return false;
@@ -4158,15 +4155,23 @@ void __fs_reclaim_release(void)

void fs_reclaim_acquire(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_acquire();
if (__need_reclaim(gfp_mask)) {
        if (!(gfp_mask & __GFP_FS))
                __fs_reclaim_acquire();
        lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
        lock_map_release(&__mmu_notifier_invalidate_range_start_map);
}
} EXPORT_SYMBOL_GPL(fs_reclaim_acquire);

void fs_reclaim_release(gfp_t gfp_mask) {
if (__need_fs_reclaim(gfp_mask))
        __fs_reclaim_release();
if (__need_reclaim(gfp_mask)) {
        if (!(gfp_mask & __GFP_FS))
                __fs_reclaim_release();
}
} EXPORT_SYMBOL_GPL(fs_reclaim_release);

#endif

2.26.2
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

4 Jun 4 Jun

8:12 a.m.

New subject: [PATCH 02/18] dma-buf: minor doc touch-ups

Just some tiny edits: - fix link to struct dma_fence - give slightly more meaningful title - the polling here is about implicit fences, explicit fences (in sync_file or drm_syncobj) also have their own polling

Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/dma-buf/dma-buf.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 01ce125f8e8d..e018ef80451e 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -161,11 +161,11 @@ static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence) }

/** - * DOC: fence polling + * DOC: implicit fence polling * * To support cross-device and cross-driver synchronization of buffer access - * implicit fences (represented internally in the kernel with &struct fence) can - * be attached to a &dma_buf. The glue for that and a few related things are + * implicit fences (represented internally in the kernel with &struct dma_fence) + * can be attached to a &dma_buf. The glue for that and a few related things are * provided in the &dma_resv structure. * * Userspace can query the state of these implicitly tracked fences using poll()

-- 2.26.2

Thomas Hellström (Intel)

10 Jun 10 Jun

1:07 p.m.

New subject: [PATCH 02/18] dma-buf: minor doc touch-ups

On 6/4/20 10:12 AM, Daniel Vetter wrote:

...

Just some tiny edits:

fix link to struct dma_fence

give slightly more meaningful title - the polling here is about implicit fences, explicit fences (in sync_file or drm_syncobj) also have their own polling

Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Reviewed-by: Thomas Hellstrom thomas.hellstrom@intel.com

Daniel Vetter

12 Jun 12 Jun

7:05 a.m.

New subject: [PATCH] dma-buf: minor doc touch-ups

v2: I misplaced the .rst include change corresponding to this patch.

Reviewed-by: Thomas Hellstrom thomas.hellstrom@intel.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 6 +++--- drivers/dma-buf/dma-buf.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..7fb7b661febd 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c - :doc: fence polling + :doc: implicit fence polling

Kernel Functions and Structures Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 01ce125f8e8d..e018ef80451e 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -161,11 +161,11 @@ static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence) }

-- 2.26.2

Daniel Vetter

24 Jun 24 Jun

7:32 p.m.

New subject: [PATCH] dma-buf: minor doc touch-ups

On Fri, Jun 12, 2020 at 09:05:35AM +0200, Daniel Vetter wrote:

...

Just some tiny edits:

fix link to struct dma_fence

give slightly more meaningful title - the polling here is about implicit fences, explicit fences (in sync_file or drm_syncobj) also have their own polling

v2: I misplaced the .rst include change corresponding to this patch.

Reviewed-by: Thomas Hellstrom thomas.hellstrom@intel.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

I went ahead and merged this one, shouldn't be the controversial part of the series :-) -Daniel

...

Documentation/driver-api/dma-buf.rst | 6 +++--- drivers/dma-buf/dma-buf.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..7fb7b661febd 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling

:doc: implicit fence polling

Kernel Functions and Structures Reference
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 01ce125f8e8d..e018ef80451e 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -161,11 +161,11 @@ static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence)
}

/**
- * DOC: fence polling
+ * DOC: implicit fence polling
 *
 * To support cross-device and cross-driver synchronization of buffer access
- * implicit fences (represented internally in the kernel with &struct fence) can
- * be attached to a &dma_buf. The glue for that and a few related things are
+ * implicit fences (represented internally in the kernel with &struct dma_fence)
+ * can be attached to a &dma_buf. The glue for that and a few related things are
 * provided in the &dma_resv structure.
 *
 * Userspace can query the state of these implicitly tracked fences using poll()
-- 
2.26.2

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

4 Jun 4 Jun

8:12 a.m.

New subject: [PATCH 03/18] dma-fence: basic lockdep annotations

Design is similar to the lockdep annotations for workers, but with some twists:

- We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200

locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.

The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A); mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A); dma_fence_wait(); mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c - :doc: fence polling + :doc: implicit fence polling

Kernel Functions and Structures Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around &dma_fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have &mmu_notifier, respectively &mmu_interval_notifier, + * and &shrinker. For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. &dma_fence interacts with almost all of the core memory + * handling through page fault handlers via &dma_resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it also interacts through all + * allocation sites through &mmu_notifier and &shrinker. + * + * Furthermore lockdep does not handle cross-release dependencies, which means + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught + * at runtime with some quick testing. The simplest example is one thread + * waiting on a &dma_fence while holding a lock:: + * + * lock(A); + * dma_fence_wait(B); + * unlock(A); + * + * while the other thread is stuck trying to acquire the same lock, which + * prevents it from signalling the fence the previous thread is stuck waiting + * on:: + * + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * + * By manually annotating all code relevant to signalling a &dma_fence we can + * teach lockdep about these dependencies, which also helps with the validation + * headache since now lockdep can check all the rules for us:: + * + * cookie = dma_fence_begin_signalling(); + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * dma_fence_end_signalling(cookie); + * + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to + * annotate critical sections the following rules need to be observed: + * + * * All code necessary to complete a &dma_fence must be annotated, from the + * point where a fence is accessible to other threads, to the point where + * dma_fence_signal() is called. Un-annotated code can contain deadlock issues, + * and due to the very strict rules and many corner cases it is infeasible to + * catch these just with review or normal stress testing. + * + * * &struct dma_resv deserves a special note, since the readers are only + * protected by rcu. This means the signalling critical section starts as soon + * as the new fences are installed, even before dma_resv_unlock() is called. + * + * * The only exception are fast paths and opportunistic signalling code, which + * calls dma_fence_signal() purely as an optimization, but is not required to + * guarantee completion of a &dma_fence. The usual example is a wait IOCTL + * which calls dma_fence_signal(), while the mandatory completion path goes + * through a hardware interrupt and possible job completion worker. + * + * * To aid composability of code, the annotations can be freely nested, as long + * as the overall locking hierarchy is consistent. The annotations also work + * both in interrupt and process context. Due to implementation details this + * requires that callers pass an opaque cookie from + * dma_fence_begin_signalling() to dma_fence_end_signalling(). + * + * * Validation against the cross driver contract is implemented by priming + * lockdep with the relevant hierarchy at boot-up. This means even just + * testing with a single device is enough to validate a driver, at least as + * far as deadlocks with dma_fence_wait() against dma_fence_signal() are + * concerned. + */ +#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = { + .name = "dma_fence_map" +}; + +/** + * dma_fence_begin_signalling - begin a critical DMA fence signalling section + * + * Drivers should use this to annotate the beginning of any code section + * required to eventually complete &dma_fence by calling dma_fence_signal(). + * + * The end of these critical sections are annotated with + * dma_fence_end_signalling(). + * + * Returns: + * + * Opaque cookie needed by the implementation, which needs to be passed to + * dma_fence_end_signalling(). + */ +bool dma_fence_begin_signalling(void) +{ + /* explicitly nesting ... */ + if (lock_is_held_type(&dma_fence_lockdep_map, 1)) + return true; + + /* rely on might_sleep check for soft/hardirq locks */ + if (in_atomic()) + return true; + + /* ... and non-recursive readlock */ + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_); + + return false; +} +EXPORT_SYMBOL(dma_fence_begin_signalling); + +/** + * dma_fence_end_signalling - end a critical DMA fence signalling section + * + * Closes a critical section annotation opened by dma_fence_begin_signalling(). + */ +void dma_fence_end_signalling(bool cookie) +{ + if (cookie) + return; + + lock_release(&dma_fence_lockdep_map, _RET_IP_); +} +EXPORT_SYMBOL(dma_fence_end_signalling); + +void __dma_fence_might_wait(void) +{ + bool tmp; + + tmp = lock_is_held_type(&dma_fence_lockdep_map, 1); + if (tmp) + lock_release(&dma_fence_lockdep_map, _THIS_IP_); + lock_map_acquire(&dma_fence_lockdep_map); + lock_map_release(&dma_fence_lockdep_map); + if (tmp) + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_); +} +#endif + + /** * dma_fence_signal_locked - signal completion of a fence * @fence: the fence to signal @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret; + bool tmp;

if (!fence) return -EINVAL;

+ tmp = dma_fence_begin_signalling(); + spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);

+ dma_fence_end_signalling(tmp); + return ret; } EXPORT_SYMBOL(dma_fence_signal); @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)

might_sleep();

+ __dma_fence_might_wait(); + trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout); diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }

+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{ + return true; +} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif + int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,

-- 2.26.2

Thomas Hellström (Intel)

8:57 a.m.

New subject: [PATCH 03/18] dma-fence: basic lockdep annotations

On 6/4/20 10:12 AM, Daniel Vetter wrote: ...

...

Thread A:

mutex_lock(A); mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A); dma_fence_wait(); mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

Just realized, isn't that example actually a true positive, or at least a great candidate for a true positive, since if another thread reenters that signaling path, it will block on that mutex, and the fence would never be signaled unless there is another signaling path?

Although I agree the conclusion is sound: These annotations cannot be sprinkled mindlessly over the code.

/Thomas

...

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {

.name = "dma_fence_map"

+};

+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */

if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
/* rely on might_sleep check for soft/hardirq locks */

if (in_atomic())
return true;
/* ... and non-recursive readlock */

lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);

return false;
+} +EXPORT_SYMBOL(dma_fence_begin_signalling);

+/**

dma_fence_end_signalling - end a critical DMA fence signalling section

Closes a critical section annotation opened by dma_fence_begin_signalling().

*/

+void dma_fence_end_signalling(bool cookie) +{
if (cookie)
return;
lock_release(&dma_fence_lockdep_map, _RET_IP_);
+} +EXPORT_SYMBOL(dma_fence_end_signalling);

+void __dma_fence_might_wait(void) +{
bool tmp;

tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);

if (tmp)
lock_release(&dma_fence_lockdep_map, _THIS_IP_);
lock_map_acquire(&dma_fence_lockdep_map);

lock_map_release(&dma_fence_lockdep_map);

if (tmp)
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+} +#endif

/**

dma_fence_signal_locked - signal completion of a fence

@fence: the fence to signal

@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret;

bool tmp;

if (!fence) return -EINVAL;

tmp = dma_fence_begin_signalling();

spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);

dma_fence_end_signalling(tmp);

return ret; } EXPORT_SYMBOL(dma_fence_signal);

@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)

might_sleep();

__dma_fence_might_wait();

trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout);

diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }

+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{

return true;

+} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif

int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,

Daniel Vetter

9:21 a.m.

New subject: [PATCH 03/18] dma-fence: basic lockdep annotations

On Thu, Jun 4, 2020 at 10:57 AM Thomas Hellström (Intel) thomas_os@shipmail.org wrote:

...

On 6/4/20 10:12 AM, Daniel Vetter wrote: ...

...
Thread A:
  mutex_lock(A);
  mutex_unlock(A);

  dma_fence_signal();
Thread B:
  mutex_lock(A);
  dma_fence_wait();
  mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
Just realized, isn't that example actually a true positive, or at least a great candidate for a true positive, since if another thread reenters that signaling path, it will block on that mutex, and the fence would never be signaled unless there is another signaling path?

Not sure I understand fully, but I think the answer is "it's complicated".

dma_fence are meant to be a DAG (directed acyclic graph). Now it would be nice to enforce that, and i915 has some attempts to that effect, but these annotations here don't try to pull off that miracle. I'm assuming that all the dependencies between dma_fence don't create a loop, and instead I'm only focusing on deadlocks between dma_fences and other locks. Usually an async work looks like this:

1. wait for a bunch of dma_fence that we have as dependencies 2. do work (e.g. atomic commit) 3. signal the dma_fence that represents our work

This can happen on the cpu in a kthread or worker, or on the gpu. Now for reasons you might want to have a per-work mutex or something and hold that while going through all this, and this is the false positive I'm thinking off. Of course, if your fences aren't a DAG, or if you're holding a mutex that's shared with some other work which is part of your dependency chain, then this goes boom. But it doesn't have to.

I think in general it's best to purely rely on ordering, and remove as much locking as possible. This is the design behind the atomic modeset commit code, which is does not take any mutexes in the commit path, at least not in the helpers. Drivers can still do stuff of course. Then the only locks you're left with are spinlocks (maybe irq safe ones) to coordinate with interrupt handlers, workers, handle the wait/wake queues, manage work/scheduler run queues and all that stuff, and no spinlocks.

Now for the case where you have something like the below:

thread 1:

dma_fence_begin_signalling() mutex_lock(a); dma_fence_wait(b1); mutex_unlock(a);

dma_fence_signal(b2); dma_fence_end_signalling();

That's indeed a bit problematic, assuming you're annotating stuff correctly, and the locking is actually required. I've seen a few of these, and annotating the properly needs care:

- often the mutex_lock/unlock is not needed, and just gets in the way. This was the case for the original atomic modeset commit work patches, which again locked all the modeset locks. But strict ordering of commit work was all that was needed to make this work, plus making sure data structure lifetimes are handled correctly too. I think the tendency to abuse locking to handle lifetime and ordering problems is fairly common, but it can lead to lots of trouble. Ime all async work items with the above problematic pattern can be fixed like this.

- other often case is that the dma_fence_begin_signalling() can&should be pushed down past the mutex_lock, and maybe even past the dma_fence_wait, depending upon when/how the dma_fence is published. The fence signalling critical section can still extend past the mutex_unlock, lockdep and semantics are fine with that (I think at least). This is more the case for execbuf tails, where you take locks, set up some async work, publish the fences and then begin to process these fences (which could just be pushing the work to the job scheduler, but could also involve running it directly in the userspace process thread context, but with locks already dropped).

So I wouldn't go out and say these are true positives, just maybe unecessary locking and over-eager annotations, without any real bugs in the code.

Or am I completely off the track and you're thinking of something else?

...

Although I agree the conclusion is sound: These annotations cannot be sprinkled mindlessly over the code.

Yup, that much is for sure. -Daniel

...

/Thomas

...
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};

+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
        return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
        return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
return false;
+} +EXPORT_SYMBOL(dma_fence_begin_signalling);

+/**

dma_fence_end_signalling - end a critical DMA fence signalling section

Closes a critical section annotation opened by dma_fence_begin_signalling().

*/

+void dma_fence_end_signalling(bool cookie) +{
if (cookie)
        return;
lock_release(&dma_fence_lockdep_map, _RET_IP_);
+} +EXPORT_SYMBOL(dma_fence_end_signalling);

+void __dma_fence_might_wait(void) +{
bool tmp;
tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
if (tmp)
        lock_release(&dma_fence_lockdep_map, _THIS_IP_);
lock_map_acquire(&dma_fence_lockdep_map);
lock_map_release(&dma_fence_lockdep_map);
if (tmp)
        lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+} +#endif

/**

dma_fence_signal_locked - signal completion of a fence

@fence: the fence to signal

@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret;
bool tmp;

if (!fence)
        return -EINVAL;
tmp = dma_fence_begin_signalling();
spin_lock_irqsave(fence->lock, flags);
ret = dma_fence_signal_locked(fence);
spin_unlock_irqrestore(fence->lock, flags);
dma_fence_end_signalling(tmp);
return ret;
} EXPORT_SYMBOL(dma_fence_signal);
@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
  might_sleep();
__dma_fence_might_wait();
trace_dma_fence_wait_start(fence);
if (fence->ops->wait)
        ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }

+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{
return true;
+} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif

int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Chris Wilson

9:26 a.m.

New subject: [PATCH 03/18] dma-fence: basic lockdep annotations

Quoting Daniel Vetter (2020-06-04 10:21:46)

...

On Thu, Jun 4, 2020 at 10:57 AM Thomas Hellström (Intel) thomas_os@shipmail.org wrote:

...
On 6/4/20 10:12 AM, Daniel Vetter wrote: ...

...
Thread A:
  mutex_lock(A);
  mutex_unlock(A);

  dma_fence_signal();
Thread B:
  mutex_lock(A);
  dma_fence_wait();
  mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
Just realized, isn't that example actually a true positive, or at least a great candidate for a true positive, since if another thread reenters that signaling path, it will block on that mutex, and the fence would never be signaled unless there is another signaling path?
Not sure I understand fully, but I think the answer is "it's complicated".

See cd8084f91c02 ("locking/lockdep: Apply crossrelease to completions")

dma_fence usage here is nothing but another name for a completion. -Chris

Daniel Vetter

9:36 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Thu, Jun 4, 2020 at 11:27 AM Chris Wilson chris@chris-wilson.co.uk wrote:

...

Quoting Daniel Vetter (2020-06-04 10:21:46)

...
On Thu, Jun 4, 2020 at 10:57 AM Thomas Hellström (Intel) thomas_os@shipmail.org wrote:

...
On 6/4/20 10:12 AM, Daniel Vetter wrote: ...

...
Thread A:
  mutex_lock(A);
  mutex_unlock(A);

  dma_fence_signal();
Thread B:
  mutex_lock(A);
  dma_fence_wait();
  mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
Just realized, isn't that example actually a true positive, or at least a great candidate for a true positive, since if another thread reenters that signaling path, it will block on that mutex, and the fence would never be signaled unless there is another signaling path?
Not sure I understand fully, but I think the answer is "it's complicated".
See cd8084f91c02 ("locking/lockdep: Apply crossrelease to completions")

dma_fence usage here is nothing but another name for a completion.

Quoting from my previous cover letter:

"I've dragged my feet for years on this, hoping that cross-release lockdep would do this for us, but well that never really happened unfortunately. So here we are."

I discussed this with Peter, cross-release not getting in is pretty final it seems. The trouble is false positives without explicit begin/end annotations reviewed by humans - ime from just these few examples you just can't guess this stuff by computeres, you need real brains thinking about all the edge cases, and where exactly the critical section starts and ends. Without that you're just going to drown in a sea of false positives and yuck.

So yeah I had hopes for cross-release too, unfortunately that was entirely in vain and a distraction.

Now I guess it would be nice if there's a per-class completion_begin/end annotation for the more generic problem. But then also most people don't have a cross-driver completion api contract like dma_fence is, with some of the most ridiculous over the top constraints of what's possible and what's not possible on each side of the cross-release. We do have a bit an outsized benefit (in pain reduction) vs cost ratio here. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Daniel Vetter

5 Jun 5 Jun

1:29 p.m.

New subject: [PATCH] dma-fence: basic lockdep annotations

Design is similar to the lockdep annotations for workers, but with some twists:

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200

locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A); mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A); dma_fence_wait(); mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Originally I hope that the cross-release lockdep extensions would alleviate the need for explicit annotations:

https://lwn.net/Articles/709849/

But there's a few reasons why that's not an option:

- It's not happening in upstream, since it got reverted due to too many false positives:

commit e966eaeeb623f09975ef362c2866fae6f86844f9 Author: Ingo Molnar mingo@kernel.org Date: Tue Dec 12 12:31:16 2017 +0100

locking/lockdep: Remove the cross-release locking checks

This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y), while it found a number of old bugs initially, was also causing too many false positives that caused people to disable lockdep - which is arguably a worse overall outcome.

- cross-release uses the complete() call to annotate the end of critical sections, for dma_fence that would be dma_fence_signal(). But we do not want all dma_fence_signal() calls to be treated as critical, since many are opportunistic cleanup of gpu requests. If these get stuck there's still the main completion interrupt and workers who can unblock everyone. Automatically annotating all dma_fence_signal() calls would hence cause false positives.

- cross-release had some educated guesses for when a critical section starts, like fresh syscall or fresh work callback. This would again cause false positives without explicit annotations, since for dma_fence the critical sections only starts when we publish a fence.

- Furthermore there can be cases where a thread never does a dma_fence_signal, but is still critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it.

In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

v5: Amend commit message to explain in detail why cross-release isn't the solution.

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c - :doc: fence polling + :doc: implicit fence polling

Kernel Functions and Structures Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

if (!fence) return -EINVAL;

+ tmp = dma_fence_begin_signalling(); + spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);

+ dma_fence_end_signalling(tmp); + return ret; } EXPORT_SYMBOL(dma_fence_signal); @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)

might_sleep();

-- 2.26.2

Thomas Hellström (Intel)

2:30 p.m.

New subject: [PATCH] dma-fence: basic lockdep annotations

On 6/5/20 3:29 PM, Daniel Vetter wrote:

...

Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
 locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A); mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A); dma_fence_wait(); mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

Originally I hope that the cross-release lockdep extensions would alleviate the need for explicit annotations:

https://lwn.net/Articles/709849/

But there's a few reasons why that's not an option:
It's not happening in upstream, since it got reverted due to too many false positives:

commit e966eaeeb623f09975ef362c2866fae6f86844f9 Author: Ingo Molnar mingo@kernel.org Date: Tue Dec 12 12:31:16 2017 +0100
 locking/lockdep: Remove the cross-release locking checks

 This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
 while it found a number of old bugs initially, was also causing too many
 false positives that caused people to disable lockdep - which is arguably
 a worse overall outcome.
cross-release uses the complete() call to annotate the end of critical sections, for dma_fence that would be dma_fence_signal(). But we do not want all dma_fence_signal() calls to be treated as critical, since many are opportunistic cleanup of gpu requests. If these get stuck there's still the main completion interrupt and workers who can unblock everyone. Automatically annotating all dma_fence_signal() calls would hence cause false positives.

cross-release had some educated guesses for when a critical section starts, like fresh syscall or fresh work callback. This would again cause false positives without explicit annotations, since for dma_fence the critical sections only starts when we publish a fence.

Furthermore there can be cases where a thread never does a dma_fence_signal, but is still critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it.
In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

v5: Amend commit message to explain in detail why cross-release isn't the solution.

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Reviewed-by: Thomas Hellström thomas.hellstrom@intel.com

Maarten Lankhorst

11 Jun 11 Jun

9:57 a.m.

New subject: [PATCH] dma-fence: basic lockdep annotations

Op 05-06-2020 om 15:29 schreef Daniel Vetter:

...

Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A); mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A); dma_fence_wait(); mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

Originally I hope that the cross-release lockdep extensions would alleviate the need for explicit annotations:

https://lwn.net/Articles/709849/

But there's a few reasons why that's not an option:
It's not happening in upstream, since it got reverted due to too many false positives:

commit e966eaeeb623f09975ef362c2866fae6f86844f9 Author: Ingo Molnar mingo@kernel.org Date: Tue Dec 12 12:31:16 2017 +0100
 locking/lockdep: Remove the cross-release locking checks

 This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
 while it found a number of old bugs initially, was also causing too many
 false positives that caused people to disable lockdep - which is arguably
 a worse overall outcome.
cross-release uses the complete() call to annotate the end of critical sections, for dma_fence that would be dma_fence_signal(). But we do not want all dma_fence_signal() calls to be treated as critical, since many are opportunistic cleanup of gpu requests. If these get stuck there's still the main completion interrupt and workers who can unblock everyone. Automatically annotating all dma_fence_signal() calls would hence cause false positives.

cross-release had some educated guesses for when a critical section starts, like fresh syscall or fresh work callback. This would again cause false positives without explicit annotations, since for dma_fence the critical sections only starts when we publish a fence.

Furthermore there can be cases where a thread never does a dma_fence_signal, but is still critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it.
In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

v5: Amend commit message to explain in detail why cross-release isn't the solution.

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling

:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences
.. kernel-doc:: drivers/dma-buf/dma-fence.c
   :doc: DMA fences overview

+DMA Fence Signalling Annotations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: fence signalling annotation
+
DMA Fences Functions Reference
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 656e9ac2d028..0005bc002529 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
}
EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
+ * DOC: fence signalling annotation
+ *
+ * Proving correctness of all the kernel code around &dma_fence through code
+ * review and testing is tricky for a few reasons:
+ *
+ * * It is a cross-driver contract, and therefore all drivers must follow the
+ *   same rules for lock nesting order, calling contexts for various functions
+ *   and anything else significant for in-kernel interfaces. But it is also
+ *   impossible to test all drivers in a single machine, hence brute-force N vs.
+ *   N testing of all combinations is impossible. Even just limiting to the
+ *   possible combinations is infeasible.
+ *
+ * * There is an enormous amount of driver code involved. For render drivers
+ *   there's the tail of command submission, after fences are published,
+ *   scheduler code, interrupt and workers to process job completion,
+ *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
+ *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
+ *   and &shrinker. For modesetting drivers there's the commit tail functions
+ *   between when fences for an atomic modeset are published, and when the
+ *   corresponding vblank completes, including any interrupt processing and
+ *   related workers. Auditing all that code, across all drivers, is not
+ *   feasible.
+ *
+ * * Due to how many other subsystems are involved and the locking hierarchies
+ *   this pulls in there is extremely thin wiggle-room for driver-specific
+ *   differences. &dma_fence interacts with almost all of the core memory
+ *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
+ *   dma_resv_unlock(). On the other side it also interacts through all
+ *   allocation sites through &mmu_notifier and &shrinker.
+ *
+ * Furthermore lockdep does not handle cross-release dependencies, which means
+ * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
+ * at runtime with some quick testing. The simplest example is one thread
+ * waiting on a &dma_fence while holding a lock::
+ *
+ *     lock(A);
+ *     dma_fence_wait(B);
+ *     unlock(A);
+ *
+ * while the other thread is stuck trying to acquire the same lock, which
+ * prevents it from signalling the fence the previous thread is stuck waiting
+ * on::
+ *
+ *     lock(A);
+ *     unlock(A);
+ *     dma_fence_signal(B);
+ *
+ * By manually annotating all code relevant to signalling a &dma_fence we can
+ * teach lockdep about these dependencies, which also helps with the validation
+ * headache since now lockdep can check all the rules for us::
+ *
+ *    cookie = dma_fence_begin_signalling();
+ *    lock(A);
+ *    unlock(A);
+ *    dma_fence_signal(B);
+ *    dma_fence_end_signalling(cookie);
+ *
+ * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
+ * annotate critical sections the following rules need to be observed:
+ *
+ * * All code necessary to complete a &dma_fence must be annotated, from the
+ *   point where a fence is accessible to other threads, to the point where
+ *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
+ *   and due to the very strict rules and many corner cases it is infeasible to
+ *   catch these just with review or normal stress testing.
+ *
+ * * &struct dma_resv deserves a special note, since the readers are only
+ *   protected by rcu. This means the signalling critical section starts as soon
+ *   as the new fences are installed, even before dma_resv_unlock() is called.
+ *
+ * * The only exception are fast paths and opportunistic signalling code, which
+ *   calls dma_fence_signal() purely as an optimization, but is not required to
+ *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
+ *   which calls dma_fence_signal(), while the mandatory completion path goes
+ *   through a hardware interrupt and possible job completion worker.
+ *
+ * * To aid composability of code, the annotations can be freely nested, as long
+ *   as the overall locking hierarchy is consistent. The annotations also work
+ *   both in interrupt and process context. Due to implementation details this
+ *   requires that callers pass an opaque cookie from
+ *   dma_fence_begin_signalling() to dma_fence_end_signalling().
+ *
+ * * Validation against the cross driver contract is implemented by priming
+ *   lockdep with the relevant hierarchy at boot-up. This means even just
+ *   testing with a single device is enough to validate a driver, at least as
+ *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
+ *   concerned.
+ */
+#ifdef CONFIG_LOCKDEP
+struct lockdep_map	dma_fence_lockdep_map = {
+	.name = "dma_fence_map"
+};
+
+/**
+ * dma_fence_begin_signalling - begin a critical DMA fence signalling section
+ *
+ * Drivers should use this to annotate the beginning of any code section
+ * required to eventually complete &dma_fence by calling dma_fence_signal().
+ *
+ * The end of these critical sections are annotated with
+ * dma_fence_end_signalling().
+ *
+ * Returns:
+ *
+ * Opaque cookie needed by the implementation, which needs to be passed to
+ * dma_fence_end_signalling().
+ */
+bool dma_fence_begin_signalling(void)
+{
+	/* explicitly nesting ... */
+	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
+		return true;
+
+	/* rely on might_sleep check for soft/hardirq locks */
+	if (in_atomic())
+		return true;
+
+	/* ... and non-recursive readlock */
+	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+	return false;
+}
+EXPORT_SYMBOL(dma_fence_begin_signalling);
+
+/**
+ * dma_fence_end_signalling - end a critical DMA fence signalling section
+ *
+ * Closes a critical section annotation opened by dma_fence_begin_signalling().
+ */
+void dma_fence_end_signalling(bool cookie)
+{
+	if (cookie)
+		return;
+
+	lock_release(&dma_fence_lockdep_map, _RET_IP_);
+}
+EXPORT_SYMBOL(dma_fence_end_signalling);
+
+void __dma_fence_might_wait(void)
+{
+	bool tmp;
+
+	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
+	if (tmp)
+		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
+	lock_map_acquire(&dma_fence_lockdep_map);
+	lock_map_release(&dma_fence_lockdep_map);
+	if (tmp)
+		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+}
+#endif
+
+
/**
 * dma_fence_signal_locked - signal completion of a fence
 * @fence: the fence to signal
@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
{
  unsigned long flags;
  int ret;
+	bool tmp;

  if (!fence)
  	return -EINVAL;

+	tmp = dma_fence_begin_signalling();
+
  spin_lock_irqsave(fence->lock, flags);
  ret = dma_fence_signal_locked(fence);
  spin_unlock_irqrestore(fence->lock, flags);

+	dma_fence_end_signalling(tmp);
+
  return ret;
}
EXPORT_SYMBOL(dma_fence_signal);
@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)

  might_sleep();

+	__dma_fence_might_wait();
+
  trace_dma_fence_wait_start(fence);
  if (fence->ops->wait)
  	ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3347c54f3a87..3f288f7db2ef 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
  } while (1);
}

+#ifdef CONFIG_LOCKDEP
+bool dma_fence_begin_signalling(void);
+void dma_fence_end_signalling(bool cookie);
+#else
+static inline bool dma_fence_begin_signalling(void)
+{
+	return true;
+}
+static inline void dma_fence_end_signalling(bool cookie) {}
+static inline void __dma_fence_might_wait(void) {}
+#endif
+
int dma_fence_signal(struct dma_fence *fence);
int dma_fence_signal_locked(struct dma_fence *fence);
signed long dma_fence_default_wait(struct dma_fence *fence,

As original author of dma-fence, I enjoy seeing more lockdep annotations. Fence was always meant to be cross-driver, so strict driver annotations that can be verified by lockdep are a good thing. Because drivers have to interact with other drivers that use dma-fence, the rules must be the same for everyone, and the above code makes sense.

Reviewed-by: Maarten Lankhorst maarten.lankhorst@linux.intel.com

Tvrtko Ursulin

10 Jun 10 Jun

2:21 p.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On 04/06/2020 09:12, Daniel Vetter wrote:

...

Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
 locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A); mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A); dma_fence_wait(); mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {

.name = "dma_fence_map"

+};

Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?

Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.

...

+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */

if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
/* rely on might_sleep check for soft/hardirq locks */

if (in_atomic())
return true;
/* ... and non-recursive readlock */

lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);

Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.

The recursive usage wouldn't work then right? Would write annotation on the wait path work?

Regards,

Tvrtko

...

return false;

+} +EXPORT_SYMBOL(dma_fence_begin_signalling);

+/**

dma_fence_end_signalling - end a critical DMA fence signalling section

Closes a critical section annotation opened by dma_fence_begin_signalling().

*/

+void dma_fence_end_signalling(bool cookie) +{
if (cookie)
return;
lock_release(&dma_fence_lockdep_map, _RET_IP_);
+} +EXPORT_SYMBOL(dma_fence_end_signalling);

+void __dma_fence_might_wait(void) +{
bool tmp;

tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);

if (tmp)
lock_release(&dma_fence_lockdep_map, _THIS_IP_);
lock_map_acquire(&dma_fence_lockdep_map);

lock_map_release(&dma_fence_lockdep_map);

if (tmp)
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+} +#endif

/**

dma_fence_signal_locked - signal completion of a fence

@fence: the fence to signal

@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret;

bool tmp;

if (!fence) return -EINVAL;

tmp = dma_fence_begin_signalling();

spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);

dma_fence_end_signalling(tmp);

return ret; } EXPORT_SYMBOL(dma_fence_signal);

@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)

might_sleep();

__dma_fence_might_wait();

trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout);

diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }

+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{

return true;

+} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif

int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,

Daniel Vetter

3:17 p.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...

On 04/06/2020 09:12, Daniel Vetter wrote:

...
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
 locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:
  mutex_lock(A);
  mutex_unlock(A);

  dma_fence_signal();
Thread B:
  mutex_lock(A);
  dma_fence_wait();
  mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?

Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.

This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.

I've typed up the full length rant already here:

https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...

...

...
+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
        return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
        return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.

Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.

...

The recursive usage wouldn't work then right? Would write annotation on the wait path work?

Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.

And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).

Cheers, Daniel

...

Regards,

Tvrtko

...
return false;
+} +EXPORT_SYMBOL(dma_fence_begin_signalling);

+/**

dma_fence_end_signalling - end a critical DMA fence signalling section

Closes a critical section annotation opened by dma_fence_begin_signalling().

*/

+void dma_fence_end_signalling(bool cookie) +{
if (cookie)
        return;
lock_release(&dma_fence_lockdep_map, _RET_IP_);
+} +EXPORT_SYMBOL(dma_fence_end_signalling);

+void __dma_fence_might_wait(void) +{
bool tmp;
tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
if (tmp)
        lock_release(&dma_fence_lockdep_map, _THIS_IP_);
lock_map_acquire(&dma_fence_lockdep_map);
lock_map_release(&dma_fence_lockdep_map);
if (tmp)
        lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+} +#endif

/**

dma_fence_signal_locked - signal completion of a fence

@fence: the fence to signal

@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret;
bool tmp;

if (!fence)
        return -EINVAL;
tmp = dma_fence_begin_signalling();
spin_lock_irqsave(fence->lock, flags);
ret = dma_fence_signal_locked(fence);
spin_unlock_irqrestore(fence->lock, flags);
dma_fence_end_signalling(tmp);
return ret;
} EXPORT_SYMBOL(dma_fence_signal);
@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
  might_sleep();
__dma_fence_might_wait();
trace_dma_fence_wait_start(fence);
if (fence->ops->wait)
        ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }

+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{
return true;
+} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif

int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tvrtko Ursulin

11 Jun 11 Jun

10:36 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On 10/06/2020 16:17, Daniel Vetter wrote:

...

On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...
On 04/06/2020 09:12, Daniel Vetter wrote:

...
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
  locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:
   mutex_lock(A);
   mutex_unlock(A);

   dma_fence_signal();
Thread B:
   mutex_lock(A);
   dma_fence_wait();
   mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?

Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.

I've typed up the full length rant already here:

https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...

But "perfect storm" of:

+ global fence lockmap + mmu notifiers + fs reclaim + default annotations in dma_fence_signal / dma_fence_wait

Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.

Example from the CI run:

[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);

What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.

...

...
...
+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
        return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
        return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.

...
The recursive usage wouldn't work then right? Would write annotation on the wait path work?

Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.

And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).

How do I see in lockdep splats if it was a read or write user? Your patch appears to have:

dma_fence_signal: + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);

__dma_fence_might_wait: + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);

Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.

Regards,

Tvrtko

Daniel Vetter

11:29 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...

On 10/06/2020 16:17, Daniel Vetter wrote:

...
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...
On 04/06/2020 09:12, Daniel Vetter wrote:

...
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
  locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:
   mutex_lock(A);
   mutex_unlock(A);

   dma_fence_signal();
Thread B:
   mutex_lock(A);
   dma_fence_wait();
   mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. kernel-doc:: drivers/dma-buf/dma-buf.c

:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?

Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.

I've typed up the full length rant already here:

https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
But "perfect storm" of:

global fence lockmap

mmu notifiers

fs reclaim

default annotations in dma_fence_signal / dma_fence_wait

Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.

Example from the CI run:

[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);

What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.

Run CI without the i915 annotation patch, nothing breaks.

So we can gradually fix up existing code that doesn't quite get it right and move on.

...

...
...
...
+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
        return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
        return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.

...
The recursive usage wouldn't work then right? Would write annotation on the wait path work?

Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.

And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
How do I see in lockdep splats if it was a read or write user? Your patch appears to have:

dma_fence_signal:
  lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
__dma_fence_might_wait:
  lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.

I think you got tricked by the implementation, this isn't quite what's going on. There's two things which make the annotations special:

- we want a recursive read lock on the signalling critical section. The problem is that lockdep doesn't implement full validation for recursive read locks, only non-recursive read/write locks fully validated. There's some checks for recursive read locks, but exactly the checks we need to catch common dma_fence_wait deadlocks aren't done. That's why we need to implement manual lock recursion on the reader side

- now on the write side we additionally need to implement an read2write upgrade, and a write2read downgrade. Lockdep doesn't implement that, so again we have to hand-roll this.

Let's go through the code line-by-line:

bool tmp;

tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);

We check whether someone is holding the non-recursive read lock already.

if (tmp) lock_release(&dma_fence_lockdep_map, _THIS_IP_);

If that's the case, we drop that read lock.

lock_map_acquire(&dma_fence_lockdep_map);

Then we do the actual might_wait annotation, the above takes the full write lock ...

lock_map_release(&dma_fence_lockdep_map);

... and now we release the write lock again.

if (tmp) lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);

Finally we need to re-acquire the read lock, if we've held that when entering this function. This annotation naturally has to exactly match what begin_signalling would do, otherwise the hand-rolled nesting would fall apart.

I hope that explains what's going on here, and assures you that might_wait() is indeed a write lock annotation, but with a big pile of complications. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tvrtko Ursulin

2:29 p.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On 11/06/2020 12:29, Daniel Vetter wrote:

...

On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...
On 10/06/2020 16:17, Daniel Vetter wrote:

...
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...
On 04/06/2020 09:12, Daniel Vetter wrote:

...
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
   locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:
    mutex_lock(A);
    mutex_unlock(A);

    dma_fence_signal();
Thread B:
    mutex_lock(A);
    dma_fence_wait();
    mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst |  12 +-
drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
include/linux/dma-fence.h            |  12 ++
3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c
:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?

Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.

I've typed up the full length rant already here:

https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
But "perfect storm" of:

global fence lockmap

mmu notifiers

fs reclaim

default annotations in dma_fence_signal / dma_fence_wait

Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.

Example from the CI run:

[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);

What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.
Run CI without the i915 annotation patch, nothing breaks.

I think some parts of i915 would still break with my idea to only apply annotations on exported fences. What do you dislike about that idea? I thought the point is to enforce rules for _exported_ fences.

How you have annotated dma_fence_work you can't say, maybe it is exported maybe it isn't. I think it is btw, so splats would still be there, but I am not sure it is conceptually correct.

At least my understanding is GFP_KERNEL allocations are only disallowed by the virtue of the global dma-fence contract. If you want to enforce they are never used for anything but exporting, then that would be a bit harsh, no?

Another example from the CI run:

[26.585357] CPU0 CPU1 [26.585359] ---- ---- [26.585360] lock(dma_fence_map); [26.585362] lock(mmu_notifier_invalidate_range_start); [26.585365] lock(dma_fence_map); [26.585367] lock(i915_gem_object_internal/1); [26.585369] *** DEADLOCK ***

Lets say someone submitted an execbuf using userptr as a batch and then unmapped it immediately. That would explain CPU1 getting into the mmu notifier and waiting on this batch to unbind the object.

Meanwhile CPU0 is the async command parser for this request trying to lock the shadow batch buffer. Because it uses the dma_fence_work this is between the begin/end signalling markers.

It can be the same dma-fence I think, since we install the async parser fence on the real batch dma-resv, but dma_fence_map is not a real lock, so what is actually preventing progress in this case?

CPU1 is waiting on a fence, but CPU0 can obtain the lock(i915_gem_object_internal/1), proceed to parse the batch, and exit the signalling section. At which point CPU1 is still blocked, waiting until the execbuf finishes and then mmu notifier can finish and invalidate the pages.

Maybe I am missing something but I don't see how this one is real.

...

So we can gradually fix up existing code that doesn't quite get it right and move on.

...
...
...
...
+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
        return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
        return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.

...
The recursive usage wouldn't work then right? Would write annotation on the wait path work?

Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.

And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
How do I see in lockdep splats if it was a read or write user? Your patch appears to have:

dma_fence_signal:
  lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
__dma_fence_might_wait:
  lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.
I think you got tricked by the implementation, this isn't quite what's going on. There's two things which make the annotations special:

we want a recursive read lock on the signalling critical section.

The problem is that lockdep doesn't implement full validation for recursive read locks, only non-recursive read/write locks fully validated. There's some checks for recursive read locks, but exactly the checks we need to catch common dma_fence_wait deadlocks aren't done. That's why we need to implement manual lock recursion on the reader side

now on the write side we additionally need to implement an

read2write upgrade, and a write2read downgrade. Lockdep doesn't implement that, so again we have to hand-roll this.

Let's go through the code line-by-line:
 bool tmp;

 tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
We check whether someone is holding the non-recursive read lock already.
 if (tmp)
     lock_release(&dma_fence_lockdep_map, _THIS_IP_);
If that's the case, we drop that read lock.
 lock_map_acquire(&dma_fence_lockdep_map);
Then we do the actual might_wait annotation, the above takes the full write lock ...
 lock_map_release(&dma_fence_lockdep_map);
... and now we release the write lock again.
 if (tmp)
     lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Finally we need to re-acquire the read lock, if we've held that when entering this function. This annotation naturally has to exactly match what begin_signalling would do, otherwise the hand-rolled nesting would fall apart.

I hope that explains what's going on here, and assures you that might_wait() is indeed a write lock annotation, but with a big pile of complications.

I am certainly confused by the difference between lock_map_acquire/release and lock_acquire/release. What is the difference between the two?

Regards,

Tvrtko

Daniel Vetter

3:03 p.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Thu, Jun 11, 2020 at 4:29 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...

On 11/06/2020 12:29, Daniel Vetter wrote:

...
On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...
On 10/06/2020 16:17, Daniel Vetter wrote:

...
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:

...
On 04/06/2020 09:12, Daniel Vetter wrote:

...
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
   locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:
    mutex_lock(A);
    mutex_unlock(A);

    dma_fence_signal();
Thread B:
    mutex_lock(A);
    dma_fence_wait();
    mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst |  12 +-
drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
include/linux/dma-fence.h            |  12 ++
3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access

-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c
:doc: fence polling
:doc: implicit fence polling

Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. kernel-doc:: drivers/dma-buf/dma-fence.c
:doc: fence signalling annotation
DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);

+/**
DOC: fence signalling annotation

Proving correctness of all the kernel code around &dma_fence through code

review and testing is tricky for a few reasons:

It is a cross-driver contract, and therefore all drivers must follow the

same rules for lock nesting order, calling contexts for various functions

and anything else significant for in-kernel interfaces. But it is also

impossible to test all drivers in a single machine, hence brute-force N vs.

N testing of all combinations is impossible. Even just limiting to the

possible combinations is infeasible.

There is an enormous amount of driver code involved. For render drivers

there's the tail of command submission, after fences are published,

scheduler code, interrupt and workers to process job completion,

and timeout, gpu reset and gpu hang recovery code. Plus for integration

with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,

and &shrinker. For modesetting drivers there's the commit tail functions

between when fences for an atomic modeset are published, and when the

corresponding vblank completes, including any interrupt processing and

related workers. Auditing all that code, across all drivers, is not

feasible.

Due to how many other subsystems are involved and the locking hierarchies

this pulls in there is extremely thin wiggle-room for driver-specific

differences. &dma_fence interacts with almost all of the core memory

handling through page fault handlers via &dma_resv, dma_resv_lock() and

dma_resv_unlock(). On the other side it also interacts through all

allocation sites through &mmu_notifier and &shrinker.

Furthermore lockdep does not handle cross-release dependencies, which means

any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught

at runtime with some quick testing. The simplest example is one thread

waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
while the other thread is stuck trying to acquire the same lock, which

prevents it from signalling the fence the previous thread is stuck waiting

on::
lock(A);
unlock(A);
dma_fence_signal(B);
By manually annotating all code relevant to signalling a &dma_fence we can

teach lockdep about these dependencies, which also helps with the validation

headache since now lockdep can check all the rules for us::

cookie = dma_fence_begin_signalling();

lock(A);

unlock(A);

dma_fence_signal(B);

dma_fence_end_signalling(cookie);

For using dma_fence_begin_signalling() and dma_fence_end_signalling() to

annotate critical sections the following rules need to be observed:

All code necessary to complete a &dma_fence must be annotated, from the

point where a fence is accessible to other threads, to the point where

dma_fence_signal() is called. Un-annotated code can contain deadlock issues,

and due to the very strict rules and many corner cases it is infeasible to

catch these just with review or normal stress testing.

&struct dma_resv deserves a special note, since the readers are only

protected by rcu. This means the signalling critical section starts as soon

as the new fences are installed, even before dma_resv_unlock() is called.

The only exception are fast paths and opportunistic signalling code, which

calls dma_fence_signal() purely as an optimization, but is not required to

guarantee completion of a &dma_fence. The usual example is a wait IOCTL

which calls dma_fence_signal(), while the mandatory completion path goes

through a hardware interrupt and possible job completion worker.

To aid composability of code, the annotations can be freely nested, as long

as the overall locking hierarchy is consistent. The annotations also work

both in interrupt and process context. Due to implementation details this

requires that callers pass an opaque cookie from

dma_fence_begin_signalling() to dma_fence_end_signalling().

Validation against the cross driver contract is implemented by priming

lockdep with the relevant hierarchy at boot-up. This means even just

testing with a single device is enough to validate a driver, at least as

far as deadlocks with dma_fence_wait() against dma_fence_signal() are

concerned.

*/
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?

Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.

I've typed up the full length rant already here:

https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
But "perfect storm" of:

global fence lockmap

mmu notifiers

fs reclaim

default annotations in dma_fence_signal / dma_fence_wait

Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.

Example from the CI run:

[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);

What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.
Run CI without the i915 annotation patch, nothing breaks.
I think some parts of i915 would still break with my idea to only apply annotations on exported fences. What do you dislike about that idea? I thought the point is to enforce rules for _exported_ fences.

dma_fence is a shared concept, this is upstream, drivers are expected to a) use shared concepts and b) use them in a consistent way. If drivers do whatever they feel like then they're no maintainable in the upstream sense of "maintainable even if the vendor walks away". This was the reason why amd had to spend 2 refactoring from DAL (which used all the helpers they shared with their firmware/windows driver) to DC (which uses all the upstream kms helpers and datastructures directly).

...

How you have annotated dma_fence_work you can't say, maybe it is exported maybe it isn't. I think it is btw, so splats would still be there, but I am not sure it is conceptually correct.

At least my understanding is GFP_KERNEL allocations are only disallowed by the virtue of the global dma-fence contract. If you want to enforce they are never used for anything but exporting, then that would be a bit harsh, no?

Another example from the CI run:

[26.585357] CPU0 CPU1 [26.585359] ---- ---- [26.585360] lock(dma_fence_map); [26.585362] lock(mmu_notifier_invalidate_range_start); [26.585365] lock(dma_fence_map); [26.585367] lock(i915_gem_object_internal/1); [26.585369] *** DEADLOCK ***

So ime the above deadlock summaries tend to be wrong as soon as you have more than 2 locks involved. Which we have here - they only ever show at most 2 threads, with each thread only taking 2 locks in total, which isn't going to deadlock if you have more than 2 locks involved. Which is the case above.

Personally I just ignore the above deadlock scenario and just always look at all the locks and backtraces lockdep gives me, and then reconstruct the dependency graph by hand myself, including deadlock scenario.

...

Lets say someone submitted an execbuf using userptr as a batch and then unmapped it immediately. That would explain CPU1 getting into the mmu notifier and waiting on this batch to unbind the object.

Meanwhile CPU0 is the async command parser for this request trying to lock the shadow batch buffer. Because it uses the dma_fence_work this is between the begin/end signalling markers.

It can be the same dma-fence I think, since we install the async parser fence on the real batch dma-resv, but dma_fence_map is not a real lock, so what is actually preventing progress in this case?

CPU1 is waiting on a fence, but CPU0 can obtain the lock(i915_gem_object_internal/1), proceed to parse the batch, and exit the signalling section. At which point CPU1 is still blocked, waiting until the execbuf finishes and then mmu notifier can finish and invalidate the pages.

Maybe I am missing something but I don't see how this one is real.

The above doesn't deadlock, and it also shouldn't result in a lockdep splat. The trouble is when the signalling thread also grabs i915_gem_object_internal/1 somewhere. Which if you go through full CI results you see there's more involved (and at least one of the splats is all just lockdep priming and might_lock, so could be an annotation bug on top), and there is indeed a path where we lock the driver private lock in more places, and the wrong way round. That's the thing lockdep is complaining about, it's just not making that clear in the summary because the summary is only ever correct for 2 locks. Not if more is involved.

...

...
So we can gradually fix up existing code that doesn't quite get it right and move on.

...
...
...
...
+/**

dma_fence_begin_signalling - begin a critical DMA fence signalling section

Drivers should use this to annotate the beginning of any code section

required to eventually complete &dma_fence by calling dma_fence_signal().

The end of these critical sections are annotated with

dma_fence_end_signalling().

Returns:

Opaque cookie needed by the implementation, which needs to be passed to

dma_fence_end_signalling().

*/

+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
        return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
        return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.

...
The recursive usage wouldn't work then right? Would write annotation on the wait path work?

Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.

And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
How do I see in lockdep splats if it was a read or write user? Your patch appears to have:

dma_fence_signal:
  lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
__dma_fence_might_wait:
  lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.
I think you got tricked by the implementation, this isn't quite what's going on. There's two things which make the annotations special:

we want a recursive read lock on the signalling critical section.

The problem is that lockdep doesn't implement full validation for recursive read locks, only non-recursive read/write locks fully validated. There's some checks for recursive read locks, but exactly the checks we need to catch common dma_fence_wait deadlocks aren't done. That's why we need to implement manual lock recursion on the reader side

now on the write side we additionally need to implement an

read2write upgrade, and a write2read downgrade. Lockdep doesn't implement that, so again we have to hand-roll this.

Let's go through the code line-by-line:
 bool tmp;

 tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
We check whether someone is holding the non-recursive read lock already.
 if (tmp)
     lock_release(&dma_fence_lockdep_map, _THIS_IP_);
If that's the case, we drop that read lock.
 lock_map_acquire(&dma_fence_lockdep_map);
Then we do the actual might_wait annotation, the above takes the full write lock ...
 lock_map_release(&dma_fence_lockdep_map);
... and now we release the write lock again.
 if (tmp)
     lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Finally we need to re-acquire the read lock, if we've held that when entering this function. This annotation naturally has to exactly match what begin_signalling would do, otherwise the hand-rolled nesting would fall apart.

I hope that explains what's going on here, and assures you that might_wait() is indeed a write lock annotation, but with a big pile of complications.
I am certainly confused by the difference between lock_map_acquire/release and lock_acquire/release. What is the difference between the two?

lock_acquire/release is a wrapper around lock_map_acquire/release. This is all lockdep internal, it's a completely undocumented maze, so unfortunately only option is to really careful follow all the definitions from various locking primitives. And then compare with lockdep self-test (which use the locking primitives, not the lockdep internals) to see which flag controls which kind of behaviour.

That's at least what I do, and it's horrible. But yeah lockdep doesn't have documentation for this.

If you think it's better to open code the lock_map/acquire, I guess I can do that. But it's a mess, so I need to carefully retest everything and make sure I've set the right flags and bits - for added fun they also change ordering in some of the wrappers! -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Chris Wilson

8 a.m.

New subject: [PATCH 03/18] dma-fence: basic lockdep annotations

Quoting Daniel Vetter (2020-06-04 09:12:09)

...

Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:
    mutex_lock(A);
    mutex_unlock(A);

    dma_fence_signal();
Thread B:
    mutex_lock(A);
    dma_fence_wait();
    mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Introducing a global lockmap that cannot capture the rules correctly, Nacked-by: Chris Wilson chris@chris-wilson.co.uk -Chris

Dave Airlie

8:44 a.m.

New subject: [PATCH 03/18] dma-fence: basic lockdep annotations

On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:

...

Quoting Daniel Vetter (2020-06-04 09:12:09)

...
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.

The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:
    mutex_lock(A);
    mutex_unlock(A);

    dma_fence_signal();
Thread B:
    mutex_lock(A);
    dma_fence_wait();
    mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Introducing a global lockmap that cannot capture the rules correctly,

Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.

Dave.

Daniel Stone

9:01 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

Hi,

On Thu, 11 Jun 2020 at 09:44, Dave Airlie airlied@gmail.com wrote:

...

On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:

...
Introducing a global lockmap that cannot capture the rules correctly,

Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.

As someone who has sunk a bunch of work into explicit-fencing awareness in my compositor so I can never be blocked, I'd be disappointed if the infrastructure was ultimately pointless because the documented fencing rules were _o_/ or thereabouts. Lockdep definitely isn't my area of expertise so I can't comment on the patch per se, but having something to ensure we don't hit deadlocks sure seems a lot better than nothing.

Cheers, Daniel

Chris Wilson

19 Jun 19 Jun

8:25 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

Quoting Daniel Stone (2020-06-11 10:01:46)

...

Hi,

On Thu, 11 Jun 2020 at 09:44, Dave Airlie airlied@gmail.com wrote:

...
On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:

...
Introducing a global lockmap that cannot capture the rules correctly,

Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.

As someone who has sunk a bunch of work into explicit-fencing awareness in my compositor so I can never be blocked, I'd be disappointed if the infrastructure was ultimately pointless because the documented fencing rules were _o_/ or thereabouts. Lockdep definitely isn't my area of expertise so I can't comment on the patch per se, but having something to ensure we don't hit deadlocks sure seems a lot better than nothing.

This is doing dependency analysis on execution contexts which is a far cry from doing the fence dependency analysis, and so has to actively ignore the cycles that must exist on the dma side, and also the cycles that prevent entering execution contexts on the CPU. It has to actively ignore scheduler execution contexts, for lockdep cries, and so we do not get analysis of the locking contexts along that path. This would be solvable along the lines of extending lockdep ala lockdep_dma_enter().

Had i915's execution flow been marked up, it should have found the dubious wait for external fences inside the dead GPU recovery, and probably found a few more things to complain about with the reset locking. [Note we already do the same annotations for wait-vs-reset, but not reset-vs-execution.]

Determination of which waits are legal and which are not is entirely ad hoc, for there is no status change tracking in the dependency analysis [that is once an execution context is linked to a published fence, again integral to lockdep.] Consider if the completion chain in atomic is swapped out for the morally equivalent fences along intertwined timelines, and so it does a bunch of dma_fence_wait() instead. Why are those waits legal despite them being after we have committed to fulfilling the out fence? [Why are the waits on and for the GPU legal, since they equally block execution flow?]

Forcing a generic primitive to always be part of the same global map is horrible. You forgo being able to use the primitive for unrelated tasks, lose the ability to name particular contexts to gain more informative dependency cycle reports from having the explicit linkage. You can add wait_map tracking without loss of generality [in less than 10 lines], and you can still enforce that all fences used for a common purpose follow the same rules [the simplest way being to default to the singular wait_map]. But it's the explicitly named execution contexts that are the biggest boon to reading the code and reading the lockdep warns.

This is a bunch of ad hoc tracking for a very narrow purpose applied globally, with loss of information. -Chris

Daniel Vetter

8:51 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:

...

Quoting Daniel Stone (2020-06-11 10:01:46)

...
Hi,

On Thu, 11 Jun 2020 at 09:44, Dave Airlie airlied@gmail.com wrote:

...
On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:

...
Introducing a global lockmap that cannot capture the rules correctly,

Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.

As someone who has sunk a bunch of work into explicit-fencing awareness in my compositor so I can never be blocked, I'd be disappointed if the infrastructure was ultimately pointless because the documented fencing rules were _o_/ or thereabouts. Lockdep definitely isn't my area of expertise so I can't comment on the patch per se, but having something to ensure we don't hit deadlocks sure seems a lot better than nothing.

This is doing dependency analysis on execution contexts which is a far cry from doing the fence dependency analysis, and so has to actively ignore the cycles that must exist on the dma side, and also the cycles that prevent entering execution contexts on the CPU. It has to actively ignore scheduler execution contexts, for lockdep cries, and so we do not get analysis of the locking contexts along that path. This would be solvable along the lines of extending lockdep ala lockdep_dma_enter().

drm/scheduler is annotated, found some rather improbably to hit issues in practice. But from the quick chat I've had with König and others I think he agrees that it's real at least in the theoretical sense. Probably should consider playing lottery if you hit it in practice though :-)

...

Had i915's execution flow been marked up, it should have found the dubious wait for external fences inside the dead GPU recovery, and probably found a few more things to complain about with the reset locking. [Note we already do the same annotations for wait-vs-reset, but not reset-vs-execution.]

I know it splats, that's why the tdr annotation patch comes with a spec proposal for lifting the wait busting we do in i915 to the dma_fence level. I included that because amdgpu has the same problem on modern hw. Apparently their planned fix (because they've hit this bug in testing) was to push some shared lock down into their atomic_comit_tail function and use that in gpu reset, so don't seem that interested in extending dma_fence.

For i915 it's just gen2/3 display, and cross-driver dma-buf/fence usage for those is nil and won't change. Pragmatic solution imo would be to just not annotate gpu reset on these platforms, and relying on our wait busting plus igt tests to make sure it keeps working as-is. The point of the explicit annotations for the signalling side is very much that it can be rolled out gradually, and entirely left out for old legacy paths that aren't worth fixing.

...

Determination of which waits are legal and which are not is entirely ad hoc, for there is no status change tracking in the dependency analysis [that is once an execution context is linked to a published fence, again integral to lockdep.] Consider if the completion chain in atomic is swapped out for the morally equivalent fences along intertwined timelines, and so it does a bunch of dma_fence_wait() instead. Why are those waits legal despite them being after we have committed to fulfilling the out fence? [Why are the waits on and for the GPU legal, since they equally block execution flow?]

No need to consider, it's already real and resulted in some pretty splats until I got the recursion handling right.

...

Forcing a generic primitive to always be part of the same global map is horrible. You forgo being able to use the primitive for unrelated tasks, lose the ability to name particular contexts to gain more informative dependency cycle reports from having the explicit linkage. You can add wait_map tracking without loss of generality [in less than 10 lines], and you can still enforce that all fences used for a common purpose follow the same rules [the simplest way being to default to the singular wait_map]. But it's the explicitly named execution contexts that are the biggest boon to reading the code and reading the lockdep warns.

So one thing that's maybe not clear here: This doesn't track the DAG of dependencies. Doesn't even try, I'm still faithfully assuming drivers get that part right. Which is a gap and maybe we should fix this, but not the goal here.

All this does is validate fences against anything else that might be going on in the system. E.g. your recursion example for atomic is handled by just assuming that any dma_fence_wait within a signalling section is legit and correct. We can add this later on, but not with lockdep, since lockdep works with classes. And proofing that dma_fences are acyclic requires you track them all as individuals. Entirely different things.

That still leaves the below:

...

Forcing a generic primitive to always be part of the same global map is horrible.

And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.

The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.

...

This is a bunch of ad hoc tracking for a very narrow purpose applied globally, with loss of information.

It doesn't solve every problem indeed. I'm happy to review patches to check acyclic-ness of dma-fence at the global level from you, I haven't figured out yet how to make that happen. I know i915-gem has that, but this is about the cross-driver contract here. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Chris Wilson

9:13 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

Quoting Daniel Vetter (2020-06-19 09:51:59)

...

On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:

...
Forcing a generic primitive to always be part of the same global map is horrible.

And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.

The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.

Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.

This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope. -Chris

Daniel Vetter

9:43 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:

...

Quoting Daniel Vetter (2020-06-19 09:51:59)

...
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:

...
Forcing a generic primitive to always be part of the same global map is horrible.

And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.

The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.

Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.

This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.

Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.

Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.

The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.

So dma_fence as general replacement for completion chains I think just wont happen.

What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.

And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.

But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.

So there's imo two cases:

- Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.

- Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.

Cheers, Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Chris Wilson

1:12 p.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

Quoting Daniel Vetter (2020-06-19 10:43:09)

...

On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:

...
Quoting Daniel Vetter (2020-06-19 09:51:59)

...
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:

...
Forcing a generic primitive to always be part of the same global map is horrible.

And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.

The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.

Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.

This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.

Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.

Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.

The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.

So dma_fence as general replacement for completion chains I think just wont happen.

That is sad. I cannot comprehend going back to pure completions after a taste of fence scheduling. And we are not even close to fully utilising them, as not all the async cpu [allocation!] tasks are fully tracked by fences yet and are still stuck in a FIFO workqueue.

...

What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.

And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.

But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.

So there's imo two cases:

Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.

Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.

I think we are faced with this very real problem.

The papering we have today over userptr is so very thin, and if you squint you can already see it is coupled into the completion signal. Just it happens to be on the other side of the fence.

The next batch of priority inversions involve integrating the async cpu tasks into the scheduler, and have full dependency tracking over every internal fence. I do not see any way to avoid coupling the completion signal from the GPU to the earliest resource allocation, as it's an unbroken chain of work, at least from the user's perspective. [Next up for annotations is that we need to always assume that userspace has an implicit lock on GPU resources; having to break that lock with a GPU reset should be a breach of our data integrity, and best avoided, for compute does not care one iota about system integrity and insist userspace knows best.] Such allocations have to be allowed to fail and for that failure to propagate cancelling the queued work, such that I'm considering what rules we need for gfp_t. That might allow enough leverage to break any fs_reclaim loops, but userptr is likely forever doomed [aside from its fs_reclaim loop is as preventable as the normal shrinker paths], but we still need to suggest to pin_user_pages that failure is better than oom and that is not clear atm. Plus the usual failure can happen at any time after updating the user facing bookkeeping, but that is just extra layers in the execution monitor ready to step in and replacing failing work with the error propagation. Or where the system grinds to a halt, requiring the monitor to patch in a new page / resource. -Chris

Daniel Vetter

22 Jun 22 Jun

9:16 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Fri, Jun 19, 2020 at 3:12 PM Chris Wilson chris@chris-wilson.co.uk wrote:

...

Quoting Daniel Vetter (2020-06-19 10:43:09)

...
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:

...
Quoting Daniel Vetter (2020-06-19 09:51:59)

...
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:

...
Forcing a generic primitive to always be part of the same global map is horrible.

And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.

The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.

Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.

This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.

Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.

Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.

The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.

So dma_fence as general replacement for completion chains I think just wont happen.

That is sad. I cannot comprehend going back to pure completions after a taste of fence scheduling. And we are not even close to fully utilising them, as not all the async cpu [allocation!] tasks are fully tracked by fences yet and are still stuck in a FIFO workqueue.

...
What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.

And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.

But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.

So there's imo two cases:

Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.

Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.

I think we are faced with this very real problem.

The papering we have today over userptr is so very thin, and if you squint you can already see it is coupled into the completion signal. Just it happens to be on the other side of the fence.

The next batch of priority inversions involve integrating the async cpu tasks into the scheduler, and have full dependency tracking over every internal fence. I do not see any way to avoid coupling the completion signal from the GPU to the earliest resource allocation, as it's an unbroken chain of work, at least from the user's perspective. [Next up for annotations is that we need to always assume that userspace has an implicit lock on GPU resources; having to break that lock with a GPU reset should be a breach of our data integrity, and best avoided, for compute does not care one iota about system integrity and insist userspace knows best.] Such allocations have to be allowed to fail and for that failure to propagate cancelling the queued work, such that I'm considering what rules we need for gfp_t. That might allow enough leverage to break any fs_reclaim loops, but userptr is likely forever doomed [aside from its fs_reclaim loop is as preventable as the normal shrinker paths], but we still need to suggest to pin_user_pages that failure is better than oom and that is not clear atm. Plus the usual failure can happen at any time after updating the user facing bookkeeping, but that is just extra layers in the execution monitor ready to step in and replacing failing work with the error propagation. Or where the system grinds to a halt, requiring the monitor to patch in a new page / resource.

Zooming out a bunch, since this is a lot about the details of making this happen, and I want to make sure I'm understanding your aim correctly. I think we have 2 big things here interacting:

On one side the "everything async" push, for some value of everything. Once everything is async we let either the linux scheduler (for dma_fence_work) or the gpu scheduler (for i915_request) figure out how to order everything, with all the dependencies. For memory allocations there's likely quite a bit of retrying (on the allocation side) and skipping (on the shrinker/mmu notifier side) involved to make this all pan out. Maybe something like a GFP_NOGPU flag.

On the other side we have opinionated userspace with both very long-running batches (they might as well be infinite, best we can do is check that they still preempt within a reasonable amount of time, lack of hw support for preemption in all cases notwithstanding). And batches which synchronize across engines and whatever entirely under userspace controls, with stuff like gpu semaphore waits entirely in the cmd stream, without any kernel or gpu scheduler involvement. Well maybe a slightly smarter gpu scheduler which converts the semaphore wait from a pure busy loop into a "repoll on each scheduler timeslice". But not actual dependency tracking awareness in the kernel (or guc/hw fwiw) of what userspace is really trying to do.

Later is a big motivator for the former, since with arbitrary long batches and arbitrary fences any wait for a batch to complete can take forever, hence anything that might end up doing that needs to be done async and without locks. That way we don't have to shoot anything if a batch takes too long.

Finally if anything goes wrong (on the kernel side at least) we just propagete fence error state through the entire ladder of in-flight things (only if it goes wrong terminally ofc).

Roughly correct or did I miss a big (or small but really important) thing?

Thanks, Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Stone

9 Jul 9 Jul

7:29 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

Hi, Jumping in after a couple of weeks where I've paged most everything out of my brain ...

On Fri, 19 Jun 2020 at 10:43, Daniel Vetter daniel@ffwll.ch wrote:

...

On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:

...
...
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.

Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.

This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.

Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.

Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.

The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.

So dma_fence as general replacement for completion chains I think just wont happen.

What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.

And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.

But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.

So there's imo two cases:

Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.

Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.

I don't claim to understand the implementation of i915's scheduler and GEM handling, and it seems like there's some public context missing here. But to me, the above is a good statement of what I (and a lot of other userspace) have been relying on - that dma-fence is a very tightly scoped thing which is very predictable but in extremis.

It would be great to have something like this enshrined in dma-fence documentation, visible to both kernel and external users. The properties we've so far been assuming for the graphics pipeline - covering production & execution of vertex/fragment workloads on the GPU, framebuffer display, and to the extent this is necessary involving compute - are something like this:

A single dma-fence with no dependencies represents (the tail of) a unit of work, which has been all but committed to the hardware. Once committed to the hardware, this work will complete (successfully or in error) in bounded time. The unit of work referred to by a dma-fence may carry dependencies on other dma-fences, which must of course be subject to the same restrictions as above. No action from any userspace component is required to ensure that the completion occurs.

The cases I know of which legitimately blow holes in this are: - the work is scheduled but GPU execution resource contention prevents it from completion, e.g. something on a higher-priority context repeatedly gets scheduled in front of it - this is OK because by definition it's what should happen - the work is scheduled but CPU execution resource contention prevents it from completion, e.g. the DRM scheduler does not get to trigger the hardware to execute the work - this is OK because at this point we have a big system-wide problem - the work is scheduled but non-execution resource contention prevents it from making progress, e.g. VRAM contention and/or a paging storm - this is OK because again we have a larger problem here and we can't reasonably expect the driver to solve this - the work is executed but execution does not complete due to the nature of the work, e.g. a chain of work contains a hostile compute shader which does not complete in any reasonable time - this is OK because we require TDR; even without a smart compositor detecting based on fence waits that the work is unsuitable and should not hold up other work, the driver will probably ban the context and lock it out anyway

The first three are general system resource-overload cases, no different from the CPU-side equivalent where it's up to the admin to impose ulimits to prevent forkbombs or runaway memory usage, or up to the user to run fewer Electron apps. The last one is more difficult, because we can't solve the halting problem to know ahead of time that the user has submitted an infinite workload, so we have to live with that as a real hazard and mitigate it where we can (by returning -EIO and killing the app from inside Mesa).

If repurposing dma-fence for non-graphics uses (like general-purpose compute or driver-internal tracking for things other than GPU workloads) makes it more difficult to guarantee the above properties, then I don't want to do it. Maybe the answer is that dma-fence gets split into its core infrastructure which can be used for completion chains, with actual dma-fence being layered above generic completion APIs: other-completion-API can consume fences, but fences _cannot_ consume non-fence things.

This does force a split between graphics (GL/Vulkan/display) workloads and compute (CL/oneAPI/HSA/CUDA), which I get is really difficult to resolve in the driver. But the two are hard split anyway: graphics requires upfront and explicit buffer management, in return dangling the carrot that you can pipeline your workloads and expect completion in reasonable time. General-purpose compute lets you go far more YOLO on resource access, including full userptr SVM, but the flipside is that your execution time might be measured in weeks; as a result you don't get to do execution pipelining because even if you could, it's not a big enough win relative to your execution time to be worth the extra driver and system complexity. I don't think there's a reasonable lowest common denominator between the two that we can try to reuse a generic model for both, because you make too many compromises to try to fit conflicting interests.

In the pre-syncobj days, we did look at what we called 'empty fences' or 'future fences' with the ChromeOS team: a synchronisation object which wasn't backed by a promise of completion as dma-fence is, but instead by the meta-promise (from userspace) of a promise of completion. Ultimately it never became a real thing for the same reason that swsync isn't either; it needed so much special-case handling and so many disclaimers and opt-ins everywhere that by the end, we weren't sure why we were trying to shoehorn it into dma-fence apart from dma-fence already existing - but by removing all its guarantees, we also removed all its usefulness as a primitive.

Cheers, Daniel

Daniel Vetter

8:01 a.m.

New subject: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

On Thu, Jul 09, 2020 at 08:29:21AM +0100, Daniel Stone wrote:

...

Hi, Jumping in after a couple of weeks where I've paged most everything out of my brain ...

On Fri, 19 Jun 2020 at 10:43, Daniel Vetter daniel@ffwll.ch wrote:

...
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:

...
...
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.

Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.

This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.

Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.

Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.

The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.

So dma_fence as general replacement for completion chains I think just wont happen.

What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.

And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.

But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.

So there's imo two cases:

Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.

Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.

I don't claim to understand the implementation of i915's scheduler and GEM handling, and it seems like there's some public context missing here. But to me, the above is a good statement of what I (and a lot of other userspace) have been relying on - that dma-fence is a very tightly scoped thing which is very predictable but in extremis.

It would be great to have something like this enshrined in dma-fence documentation, visible to both kernel and external users. The properties we've so far been assuming for the graphics pipeline - covering production & execution of vertex/fragment workloads on the GPU, framebuffer display, and to the extent this is necessary involving compute - are something like this:

A single dma-fence with no dependencies represents (the tail of) a unit of work, which has been all but committed to the hardware. Once committed to the hardware, this work will complete (successfully or in error) in bounded time. The unit of work referred to by a dma-fence may carry dependencies on other dma-fences, which must of course be subject to the same restrictions as above. No action from any userspace component is required to ensure that the completion occurs.

The cases I know of which legitimately blow holes in this are:

the work is scheduled but GPU execution resource contention

prevents it from completion, e.g. something on a higher-priority context repeatedly gets scheduled in front of it - this is OK because by definition it's what should happen

the work is scheduled but CPU execution resource contention

prevents it from completion, e.g. the DRM scheduler does not get to trigger the hardware to execute the work - this is OK because at this point we have a big system-wide problem

the work is scheduled but non-execution resource contention

prevents it from making progress, e.g. VRAM contention and/or a paging storm - this is OK because again we have a larger problem here and we can't reasonably expect the driver to solve this

the work is executed but execution does not complete due to the

nature of the work, e.g. a chain of work contains a hostile compute shader which does not complete in any reasonable time - this is OK because we require TDR; even without a smart compositor detecting based on fence waits that the work is unsuitable and should not hold up other work, the driver will probably ban the context and lock it out anyway

The first three are general system resource-overload cases, no different from the CPU-side equivalent where it's up to the admin to impose ulimits to prevent forkbombs or runaway memory usage, or up to the user to run fewer Electron apps. The last one is more difficult, because we can't solve the halting problem to know ahead of time that the user has submitted an infinite workload, so we have to live with that as a real hazard and mitigate it where we can (by returning -EIO and killing the app from inside Mesa).

If repurposing dma-fence for non-graphics uses (like general-purpose compute or driver-internal tracking for things other than GPU workloads) makes it more difficult to guarantee the above properties, then I don't want to do it. Maybe the answer is that dma-fence gets split into its core infrastructure which can be used for completion chains, with actual dma-fence being layered above generic completion APIs: other-completion-API can consume fences, but fences _cannot_ consume non-fence things.

This does force a split between graphics (GL/Vulkan/display) workloads and compute (CL/oneAPI/HSA/CUDA), which I get is really difficult to resolve in the driver. But the two are hard split anyway: graphics requires upfront and explicit buffer management, in return dangling the carrot that you can pipeline your workloads and expect completion in reasonable time. General-purpose compute lets you go far more YOLO on resource access, including full userptr SVM, but the flipside is that your execution time might be measured in weeks; as a result you don't get to do execution pipelining because even if you could, it's not a big enough win relative to your execution time to be worth the extra driver and system complexity. I don't think there's a reasonable lowest common denominator between the two that we can try to reuse a generic model for both, because you make too many compromises to try to fit conflicting interests.

In the pre-syncobj days, we did look at what we called 'empty fences' or 'future fences' with the ChromeOS team: a synchronisation object which wasn't backed by a promise of completion as dma-fence is, but instead by the meta-promise (from userspace) of a promise of completion. Ultimately it never became a real thing for the same reason that swsync isn't either; it needed so much special-case handling and so many disclaimers and opt-ins everywhere that by the end, we weren't sure why we were trying to shoehorn it into dma-fence apart from dma-fence already existing - but by removing all its guarantees, we also removed all its usefulness as a primitive.

New series has a patch which tries to at least somewhat summarize this entire problem, and why it just doesn't work. Doesn't contain yet the full proposed solution, but maybe that's best for a follow-up patch. Anyway probably best if we poke holes at that text there.

Between the preepmt ctx fence in amdgpu and userspace fences or gpu futex or whatever you want to call it, I do think we can make the compute side happy. The sad puppy face comes a bit from vulkan, since vulkan would really like the same execution model, but because it needs to integrate with the overall dma-fence based compositor stack, it can't.

I think even that is solveable, if we have vulkan-based compositors and a completely new set of protocols and uapi from client all the way down to display. That makes it about as bad as a flag day as atomic+modifiers.

Also the only reason why the kms driver can then suddenly import a userspace fence, while nothing else in the kernel can allow such dependencies is fairly simple: Framebuffers are pinned, which breaks the dependency loops in the memory manager, and so avoids all the troubles in a slightly different form.

And of course we'd need a timeout in case userspace just screwed up somehow. -Daniel

...

Cheers, Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

12 Jun 12 Jun

7:06 a.m.

New subject: [PATCH] dma-fence: basic lockdep annotations

Design is similar to the lockdep annotations for workers, but with some twists:

commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200

locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A); mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A); dma_fence_wait(); mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.

Originally I hope that the cross-release lockdep extensions would alleviate the need for explicit annotations:

https://lwn.net/Articles/709849/

But there's a few reasons why that's not an option:

- It's not happening in upstream, since it got reverted due to too many false positives:

commit e966eaeeb623f09975ef362c2866fae6f86844f9 Author: Ingo Molnar mingo@kernel.org Date: Tue Dec 12 12:31:16 2017 +0100

locking/lockdep: Remove the cross-release locking checks

In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations.

v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

v5: Amend commit message to explain in detail why cross-release isn't the solution.

v6: Pull out misplaced .rst hunk.

Reviewed-by: Thomas Hellström thomas.hellstrom@intel.com Reviewed-by: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 6 + drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 179 insertions(+)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 7fb7b661febd..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

if (!fence) return -EINVAL;

+ tmp = dma_fence_begin_signalling(); + spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);

+ dma_fence_end_signalling(tmp); + return ret; } EXPORT_SYMBOL(dma_fence_signal); @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)

might_sleep();

-- 2.26.2

Daniel Vetter

4 Jun 4 Jun

8:12 a.m.

New subject: [PATCH 04/18] dma-fence: prime lockdep annotations

Two in one go: - it is allowed to call dma_fence_wait() while holding a dma_resv_lock(). This is fundamental to how eviction works with ttm, so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts, specifically from shrinker callbacks (which i915 does), and from mmu notifier callbacks (which amdgpu does, and which i915 sometimes also does, and probably always should, but that's kinda a debate). Also for stuff like HMM we really need to be able to do this, or things get real dicey.

Consequence is that any critical path necessary to get to a dma_fence_signal for a fence must never a) call dma_resv_lock nor b) allocate memory with GFP_KERNEL. Also by implication of dma_resv_lock(), no userspace faulting allowed. That's some supremely obnoxious limitations, which is why we need to sprinkle the right annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers, added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Aug 26 22:14:21 2019 +0200

mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to wait on dma-fences from mmu notifiers. But there's no ready-made functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v3: kerneldoc to spec the cross-driver contract. Note that currently i915 throws in a hard-coded 10s timeout on foreign fences (not sure why that was done, but it's there), which is why that rule is worded with SHOULD instead of MUST.

Also some of the mmu_notifier/shrinker rules might surprise SoC drivers, I haven't fully audited them all. Which is infeasible anyway, we'll need to run them with lockdep and dma-fence annotations and see what goes boom.

v4: A spelling fix from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 6 ++++ drivers/dma-buf/dma-fence.c | 41 ++++++++++++++++++++++++++++ drivers/dma-buf/dma-resv.c | 4 +++ include/linux/dma-fence.h | 1 + 4 files changed, 52 insertions(+)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 05d856131140..f8f6decde359 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview

+DMA Fence Cross-Driver Contract +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence cross-driver contract + DMA Fence Signalling Annotations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 0005bc002529..754e6fb84fb7 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -64,6 +64,47 @@ static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(1); * &dma_buf.resv pointer. */

+/** + * DOC: fence cross-driver contract + * + * Since &dma_fence provide a cross driver contract, all drivers must follow the + * same rules: + * + * * Fences must complete in a reasonable time. Fences which represent kernels + * and shaders submitted by userspace, which could run forever, must be backed + * up by timeout and gpu hang recovery code. Minimally that code must prevent + * further command submission and force complete all in-flight fences, e.g. + * when the driver or hardware do not support gpu reset, or if the gpu reset + * failed for some reason. Ideally the driver supports gpu recovery which only + * affects the offending userspace context, and no other userspace + * submissions. + * + * * Drivers may have different ideas of what completion within a reasonable + * time means. Some hang recovery code uses a fixed timeout, others a mix + * between observing forward progress and increasingly strict timeouts. + * Drivers should not try to second guess timeout handling of fences from + * other drivers. + * + * * To ensure there's no deadlocks of dma_fence_wait() against other locks + * drivers should annotate all code required to reach dma_fence_signal(), + * which completes the fences, with dma_fence_begin_signalling() and + * dma_fence_end_signalling(). + * + * * Drivers are allowed to call dma_fence_wait() while holding dma_resv_lock(). + * This means any code required for fence completion cannot acquire a + * &dma_resv lock. Note that this also pulls in the entire established + * locking hierarchy around dma_resv_lock() and dma_resv_unlock(). + * + * * Drivers are allowed to call dma_fence_wait() from their &shrinker + * callbacks. This means any code required for fence completion cannot + * allocate memory with GFP_KERNEL. + * + * * Drivers are allowed to call dma_fence_wait() from their &mmu_notifier + * respectively &mmu_interval_notifier callbacks. This means any code required + * for fence completeion cannot allocate memory with GFP_NOFS or GFP_NOIO. + * Only GFP_ATOMIC is permissible, which might fail. + */ + static const char *dma_fence_stub_get_name(struct dma_fence *fence) { return "stub"; diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c index 99c0a33c918d..c223f32425c4 100644 --- a/drivers/dma-buf/dma-resv.c +++ b/drivers/dma-buf/dma-resv.c @@ -35,6 +35,7 @@ #include <linux/dma-resv.h> #include <linux/export.h> #include <linux/sched/mm.h> +#include <linux/mmu_notifier.h>

/** * DOC: Reservation Object Overview @@ -115,6 +116,9 @@ static int __init dma_resv_lockdep(void) if (ret == -EDEADLK) dma_resv_lock_slow(&obj, &ctx); fs_reclaim_acquire(GFP_KERNEL); + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + __dma_fence_might_wait(); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); fs_reclaim_release(GFP_KERNEL); ww_mutex_unlock(&obj.lock); ww_acquire_fini(&ctx); diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3f288f7db2ef..09e23adb351d 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) #ifdef CONFIG_LOCKDEP bool dma_fence_begin_signalling(void); void dma_fence_end_signalling(bool cookie); +void __dma_fence_might_wait(void); #else static inline bool dma_fence_begin_signalling(void) {

-- 2.26.2

Thomas Hellström (Intel)

11 Jun 11 Jun

7:30 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On 6/4/20 10:12 AM, Daniel Vetter wrote:

...

Two in one go:

it is allowed to call dma_fence_wait() while holding a dma_resv_lock(). This is fundamental to how eviction works with ttm, so required.

it is allowed to call dma_fence_wait() from memory reclaim contexts, specifically from shrinker callbacks (which i915 does), and from mmu notifier callbacks (which amdgpu does, and which i915 sometimes also does, and probably always should, but that's kinda a debate). Also for stuff like HMM we really need to be able to do this, or things get real dicey.

Consequence is that any critical path necessary to get to a dma_fence_signal for a fence must never a) call dma_resv_lock nor b) allocate memory with GFP_KERNEL. Also by implication of dma_resv_lock(), no userspace faulting allowed. That's some supremely obnoxious limitations, which is why we need to sprinkle the right annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers, added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Aug 26 22:14:21 2019 +0200
 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
that one covers a lot of other callsites, and it's also allowed to wait on dma-fences from mmu notifiers. But there's no ready-made functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v3: kerneldoc to spec the cross-driver contract. Note that currently i915 throws in a hard-coded 10s timeout on foreign fences (not sure why that was done, but it's there), which is why that rule is worded with SHOULD instead of MUST.

Also some of the mmu_notifier/shrinker rules might surprise SoC drivers, I haven't fully audited them all. Which is infeasible anyway, we'll need to run them with lockdep and dma-fence annotations and see what goes boom.

v4: A spelling fix from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 6 ++++ drivers/dma-buf/dma-fence.c | 41 ++++++++++++++++++++++++++++ drivers/dma-buf/dma-resv.c | 4 +++ include/linux/dma-fence.h | 1 + 4 files changed, 52 insertions(+)

I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Reviewed-by: Thomas Hellström thomas.hellstrom@intel.com

Daniel Vetter

8:34 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Thu, Jun 11, 2020 at 09:30:12AM +0200, Thomas Hellström (Intel) wrote:

...

On 6/4/20 10:12 AM, Daniel Vetter wrote:

...
Two in one go:

it is allowed to call dma_fence_wait() while holding a dma_resv_lock(). This is fundamental to how eviction works with ttm, so required.

it is allowed to call dma_fence_wait() from memory reclaim contexts, specifically from shrinker callbacks (which i915 does), and from mmu notifier callbacks (which amdgpu does, and which i915 sometimes also does, and probably always should, but that's kinda a debate). Also for stuff like HMM we really need to be able to do this, or things get real dicey.

Consequence is that any critical path necessary to get to a dma_fence_signal for a fence must never a) call dma_resv_lock nor b) allocate memory with GFP_KERNEL. Also by implication of dma_resv_lock(), no userspace faulting allowed. That's some supremely obnoxious limitations, which is why we need to sprinkle the right annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers, added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Aug 26 22:14:21 2019 +0200
 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
that one covers a lot of other callsites, and it's also allowed to wait on dma-fences from mmu notifiers. But there's no ready-made functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v3: kerneldoc to spec the cross-driver contract. Note that currently i915 throws in a hard-coded 10s timeout on foreign fences (not sure why that was done, but it's there), which is why that rule is worded with SHOULD instead of MUST.

Also some of the mmu_notifier/shrinker rules might surprise SoC drivers, I haven't fully audited them all. Which is infeasible anyway, we'll need to run them with lockdep and dma-fence annotations and see what goes boom.

v4: A spelling fix from Mika

Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

Documentation/driver-api/dma-buf.rst | 6 ++++ drivers/dma-buf/dma-fence.c | 41 ++++++++++++++++++++++++++++ drivers/dma-buf/dma-resv.c | 4 +++ include/linux/dma-fence.h | 1 + 4 files changed, 52 insertions(+)
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...

But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I do wonder whether the mmu notifier constraint should only be set when mmu notifiers are enabled, since on a bunch of arm-soc gpu drivers that stuff just doesn't matter. But I expect that sooner or later these arm gpus will show up in bigger arm cores, where you might want to have kvm and maybe device virtualization and stuff, and then you need mmu notifiers.

Plus having a very clear and consistent cross-driver api contract is imo better than leaving this up to drivers and then having incompatible assumptions.

I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there). I think it'll take us a while to really bottom out on this specific question here. -Daniel

...

Reviewed-by: Thomas Hellström thomas.hellstrom@intel.com

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

2:15 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...

...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

...

I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Right, nor will RDMA ODP.

Jason

Felix Kuehling

11:35 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:

...

On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.

Regards, Felix

...

Right, nor will RDMA ODP.

Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Daniel Vetter

12 Jun 12 Jun

5:11 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:

...

Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:

...
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.

Well amdgpu already has dma_fence waits in the hmm callbacks, so nothing new. But since you start using these in amdkfd ... perfect opportunity to annotate the amdkfd paths for fence signalling critical sections? Especially the preempt-ctx fence should be an interesting case to annotate and see whether lockdep finds anything. Not sure what else there is. -Daniel

...

Regards, Felix

...
Right, nor will RDMA ODP.

Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jerome Glisse

19 Jun 19 Jun

6:13 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Thu, Jun 11, 2020 at 07:35:35PM -0400, Felix Kuehling wrote:

...

Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:

...
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.

Note that HMM mandate, and i stressed that several time in the past, that all GPU page table update are asynchronous and do not have to wait on _anything_.

I understand that you use DMA engine for GPU page table update but if you want to do so with HMM then you need a GPU page table update only DMA context where all GPU page table update goes through and where user space can not queue up job.

It can be for HMM only but if you want to mix HMM with non HMM then everything need to be on that queue and other command queue will have to depends on it.

Cheers, Jérôme

Daniel Vetter

23 Jun 23 Jun

7:39 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:

...

Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:

...
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.

Can you pls cc these patches to dri-devel when they show up? Depending upon how your hw works there's and endless amount of bad things that can happen.

Also I think (again depending upon how the hw exactly works) this stuff would be a perfect example for the dma_fence annotations.

The worst case is if your hw cannot preempt while a hw page fault is pending. That means none of the dma_fence will ever signal (the amdkfd preempt ctx fences wont, and the classic fences from amdgpu might be also stall). At least when you're unlucky and the fence you're waiting on somehow (anywhere in its dependency chain really) need the engine that's currently blocked waiting for the hw page fault.

That in turn means anything you do in your hw page fault handler is in the critical section for dma fence signalling, which has far reaching implications. -Daniel

...

Regards, Felix

...
Right, nor will RDMA ODP.

Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Felix Kuehling

6:44 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

Am 2020-06-23 um 3:39 a.m. schrieb Daniel Vetter:

...

On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:

...
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:

...
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.

Can you pls cc these patches to dri-devel when they show up? Depending upon how your hw works there's and endless amount of bad things that can happen.

Yes, I'll do that.

...

Also I think (again depending upon how the hw exactly works) this stuff would be a perfect example for the dma_fence annotations.

We have already applied your patch series to our development branch. I haven't looked into what annotations we'd have to add to our new code yet.

...

The worst case is if your hw cannot preempt while a hw page fault is pending. That means none of the dma_fence will ever signal (the amdkfd preempt ctx fences wont, and the classic fences from amdgpu might be also stall). At least when you're unlucky and the fence you're waiting on somehow (anywhere in its dependency chain really) need the engine that's currently blocked waiting for the hw page fault.

Our HW can preempt while handling a page fault, at least on the GPU generation we're working on now. On other GPUs we haven't included in our initial effort, we will not be able to preempt while a page fault is in progress. This is problematic, but that's for reasons related to our GPU hardware scheduler and unrelated to fences.

...

That in turn means anything you do in your hw page fault handler is in the critical section for dma fence signalling, which has far reaching implications.

I'm not sure I agree, at least for KFD. The only place where KFD uses fences that depend on preemptions is eviction fences. And we can get rid of those if we can preempt GPU access to specific BOs by invalidating GPU PTEs. That way we don't need to preempt the GPU queues while a page fault is in progress. Instead we would create more page faults.

That assumes that we can invalidate GPU PTEs without depending on fences. We've discussed possible deadlocks due to memory allocations needed on that code paths for IBs or page tables. We've already eliminated page table allocations and reservation locks on the PTE invalidation code path. And we're using a separate scheduler entity so we can't get stuck behind other IBs that depend on fences. IIRC, Christian also implemented a separate memory pool for IBs for this code path.

Regards, Felix

...

-Daniel

...
Regards, Felix

...
Right, nor will RDMA ODP.

Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Daniel Vetter

7:02 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Tue, Jun 23, 2020 at 02:44:24PM -0400, Felix Kuehling wrote:

...

Am 2020-06-23 um 3:39 a.m. schrieb Daniel Vetter:

...
On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:

...
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:

...
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.

Can you pls cc these patches to dri-devel when they show up? Depending upon how your hw works there's and endless amount of bad things that can happen.

Yes, I'll do that.

...
Also I think (again depending upon how the hw exactly works) this stuff would be a perfect example for the dma_fence annotations.

We have already applied your patch series to our development branch. I haven't looked into what annotations we'd have to add to our new code yet.

...
The worst case is if your hw cannot preempt while a hw page fault is pending. That means none of the dma_fence will ever signal (the amdkfd preempt ctx fences wont, and the classic fences from amdgpu might be also stall). At least when you're unlucky and the fence you're waiting on somehow (anywhere in its dependency chain really) need the engine that's currently blocked waiting for the hw page fault.

Our HW can preempt while handling a page fault, at least on the GPU generation we're working on now. On other GPUs we haven't included in our initial effort, we will not be able to preempt while a page fault is in progress. This is problematic, but that's for reasons related to our GPU hardware scheduler and unrelated to fences.

Well the trouble is if the page fault holds up a preempt, then there's no way for a dma_fence to complete while your hw page fault handler is stuck doing whatever. That means the entire hw page fault becomes a fence signalling critical section, with the consequence that there's almost nothing you can actually do. System memory becomes GFP_ATOMIC only, and for vram you need to make sure that you never evict anything that might be in active use.

So not enabling these platforms sounds like a very good plan to me :-)

...

...
That in turn means anything you do in your hw page fault handler is in the critical section for dma fence signalling, which has far reaching implications.

I'm not sure I agree, at least for KFD. The only place where KFD uses fences that depend on preemptions is eviction fences. And we can get rid of those if we can preempt GPU access to specific BOs by invalidating GPU PTEs. That way we don't need to preempt the GPU queues while a page fault is in progress. Instead we would create more page faults.

The big problem isn't pure kfd workloads, all the trouble comes in when you mix kfd and amdgpu workloads. kfd alone is easy, just make sure there's no fences to begin with, and there will be no problems.

...

That assumes that we can invalidate GPU PTEs without depending on fences. We've discussed possible deadlocks due to memory allocations needed on that code paths for IBs or page tables. We've already eliminated page table allocations and reservation locks on the PTE invalidation code path. And we're using a separate scheduler entity so we can't get stuck behind other IBs that depend on fences. IIRC, Christian also implemented a separate memory pool for IBs for this code path.

Yeah it's the memory allocations that kill you. Both system memory, but also vram. Since evicting vram might mean you end up stuck behind a dma_fence of a legacy context hogging that memory, and probably also means doing a few dma_resv_lock. All of these thing deadlock if you can't preempt the context with something else. -Daniel

...

Regards, Felix

...
-Daniel

...
Regards, Felix

...
Right, nor will RDMA ODP.

Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

16 Jun 16 Jun

12:07 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

Hi Jason,

Somehow this got stuck somewhere in the mail queues, only popped up just now ...

On Thu, Jun 11, 2020 at 11:15:15AM -0300, Jason Gunthorpe wrote:

...

On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The problem with gpus is that these completions leak across the board like mad. Both internally within memory managers (made a lot worse with p2p direct access to vram), and through uapi.

Many gpus still have a very hard time preempting, so doing an overall switch in drivers/gpu to a memory management model where that is required is not a very realistic option. And minimally you need either preempt (still takes a while, but a lot faster generally than waiting for work to complete) or hw faults (just a bunch of tlb flushes plus virtual indexed caches, so just the caveat of that for a gpu, which has lots and big tlbs and caches). So preventing the completion leaks within the kernel is I think unrealistic, except if we just say "well sorry, run on windows, mkay" for many gpu workloads. Or more realistic "well sorry, run on the nvidia blob with nvidia hw".

The userspace side we can somewhat isolate, at least for pure compute workloads. But the thing is drivers/gpu is a continum from tiny socs (where dma_fence is a very nice model) to huge compute stuff (where it's maybe not the nicest, but hey hw sucks so still neeeded). Doing full on break in uapi somewhere in there is at least a bit awkward, e.g. some of the media codec code on intel runs all the way from the smallest intel soc to the big transcode servers.

So the current status quo is "total mess, every driver defines their own rules". All I'm trying to do is some common rules here, do make this mess slightly more manageable and overall reviewable and testable.

I have no illusions that this is fundamentally pretty horrible, and the leftover wiggle room for writing memory manager is barely more than a hairline. Just not seeing how other options are better.

...

The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

It's not everyone, or at least not everywhere, it's some fairly limited cases. Also, even if we drop the mmu_notifier on the floor, then we're stuck with shrinkers and GFP_NOFS. Still need a mempool of some sorts to guarantee you get out of a bind, so not much better.

At least that's my current understanding of where we are across all drivers.

...

...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Right, nor will RDMA ODP.

Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

2:53 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Tue, Jun 16, 2020 at 02:07:19PM +0200, Daniel Vetter wrote:

...

...
...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Right, nor will RDMA ODP.

Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here.

rdma does not use dma_fence at all, and though it is hard to tell, I didn't notice a dma_fence in the nouveau invalidation call path.

At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.

Ie it is strange that the new totally-not-a-gpu drivers use dma_fence, they surely don't have the same constraints as the existing GPU world, and it would be annoying to see dma_fence notifiers spring up in them

Jason

Daniel Vetter

17 Jun 17 Jun

7:57 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Wed, Jun 17, 2020 at 9:27 AM Jason Gunthorpe jgg@ziepe.ca wrote:

...

On Tue, Jun 16, 2020 at 02:07:19PM +0200, Daniel Vetter wrote:

...
...
...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Right, nor will RDMA ODP.

Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here.

rdma does not use dma_fence at all, and though it is hard to tell, I didn't notice a dma_fence in the nouveau invalidation call path.

Nouveau for compute has hw page faults. It doesn't have hw page faults for non-compute fixed function blocks afaik, so there's a hybrid model going on. But nouveau also doesn't support userspace memory (instead of driver-allocated buffer objects) for these fixed function blocks, so no need to have a dma_fence_wait in there.

...

At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.

Yeah I'm working on documentation, and also the notifiers here hopefully make it clear it's massive pain. I think we could even make a hard rule that dma_fence in mmu notifier outside of drivers/gpu is a bug/misfeature.

Might be a good idea to add a MAINTAINERS entry with a K: regex pattern, so that you can catch such modifiers. We do already have such a pattern for dma-fence, to catch abuse. So if you want I could type up a documentation patch for this, get your and others acks and the dri-devel folks would enforce that the dma_fence_wait madness doesn't leak beyond drivers/gpu

...

Ie it is strange that the new totally-not-a-gpu drivers use dma_fence, they surely don't have the same constraints as the existing GPU world, and it would be annoying to see dma_fence notifiers spring up in them

If you mean drivers/misc/habanalabs, that's going to get taken care of:

commit ed65bfd9fd86dec3772570b0320ca85b9fb69f2e Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon May 11 11:11:42 2020 +0200

habanalabs: don't set default fence_ops->wait

It's the default.

Also so much for "we're not going to tell the graphics people how to review their code", dma_fence is a pretty core piece of gpu driver infrastructure. And it's very much uapi relevant, including piles of corresponding userspace protocols and libraries for how to pass these around.

Would be great if habanalabs would not use this (from a quick look it's not needed at all), since open source the userspace and playing by the usual rules isn't on the table. If that's not possible (because it's actually using the uapi part of dma_fence to interact with gpu drivers) then we have exactly what everyone promised we'd want to avoid.

Signed-off-by: Daniel Vetter daniel.vetter@intel.com Reviewed-by: Oded Gabbay oded.gabbay@gmail.com Signed-off-by: Oded Gabbay oded.gabbay@gmail.com

Oded has agreed to remove the dma-fence usage, since they really don't need it (and all the baggage that comes with it), plain old completion is enough for their use. This use is also why I added the regex to MAINTAINERS, so that in the future we can catch people who try to use dma_fence because it looks cute and useful, and are completely oblivious to all the pain and headaches involved.

Cheers, Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

3:29 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Wed, Jun 17, 2020 at 09:57:54AM +0200, Daniel Vetter wrote:

...

...
At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.

Yeah I'm working on documentation, and also the notifiers here hopefully make it clear it's massive pain. I think we could even make a hard rule that dma_fence in mmu notifier outside of drivers/gpu is a bug/misfeature.

Yep!

...

Might be a good idea to add a MAINTAINERS entry with a K: regex pattern, so that you can catch such modifiers. We do already have such a pattern for dma-fence, to catch abuse. So if you want I could type up a documentation patch for this, get your and others acks and the dri-devel folks would enforce that the dma_fence_wait madness doesn't leak beyond drivers/gpu

It seems like the best thing

...

Oded has agreed to remove the dma-fence usage, since they really don't need it (and all the baggage that comes with it), plain old completion is enough for their use. This use is also why I added the regex to MAINTAINERS, so that in the future we can catch people who try to use dma_fence because it looks cute and useful, and are completely oblivious to all the pain and headaches involved.

This is good!

Thanks, Jason

Daniel Vetter

18 Jun 18 Jun

2:42 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Wed, Jun 17, 2020 at 12:29:40PM -0300, Jason Gunthorpe wrote:

...

On Wed, Jun 17, 2020 at 09:57:54AM +0200, Daniel Vetter wrote:

...
...
At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.

Yeah I'm working on documentation, and also the notifiers here hopefully make it clear it's massive pain. I think we could even make a hard rule that dma_fence in mmu notifier outside of drivers/gpu is a bug/misfeature.

Yep!

...
Might be a good idea to add a MAINTAINERS entry with a K: regex pattern, so that you can catch such modifiers. We do already have such a pattern for dma-fence, to catch abuse. So if you want I could type up a documentation patch for this, get your and others acks and the dri-devel folks would enforce that the dma_fence_wait madness doesn't leak beyond drivers/gpu

It seems like the best thing

Just thought about where to best put this, and I think including it as another paragraph in the next round of this series makes the most sense. You'll get cc'ed for acking when that happens - might take a while since there's a lot of details here all over to sort out. -Daniel

...

...
Oded has agreed to remove the dma-fence usage, since they really don't need it (and all the baggage that comes with it), plain old completion is enough for their use. This use is also why I added the regex to MAINTAINERS, so that in the future we can catch people who try to use dma_fence because it looks cute and useful, and are completely oblivious to all the pain and headaches involved.

This is good!

Thanks, Jason

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

17 Jun 17 Jun

6:48 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Tue, Jun 16, 2020 at 2:07 PM Daniel Vetter daniel@ffwll.ch wrote:

...

Hi Jason,

Somehow this got stuck somewhere in the mail queues, only popped up just now ...

On Thu, Jun 11, 2020 at 11:15:15AM -0300, Jason Gunthorpe wrote:

...
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:

...
...
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).

Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.

...
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,

Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.

I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.

Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.

The problem with gpus is that these completions leak across the board like mad. Both internally within memory managers (made a lot worse with p2p direct access to vram), and through uapi.

Many gpus still have a very hard time preempting, so doing an overall switch in drivers/gpu to a memory management model where that is required is not a very realistic option. And minimally you need either preempt (still takes a while, but a lot faster generally than waiting for work to complete) or hw faults (just a bunch of tlb flushes plus virtual indexed caches, so just the caveat of that for a gpu, which has lots and big tlbs and caches). So preventing the completion leaks within the kernel is I think unrealistic, except if we just say "well sorry, run on windows, mkay" for many gpu workloads. Or more realistic "well sorry, run on the nvidia blob with nvidia hw".

The userspace side we can somewhat isolate, at least for pure compute workloads. But the thing is drivers/gpu is a continum from tiny socs (where dma_fence is a very nice model) to huge compute stuff (where it's maybe not the nicest, but hey hw sucks so still neeeded). Doing full on break in uapi somewhere in there is at least a bit awkward, e.g. some of the media codec code on intel runs all the way from the smallest intel soc to the big transcode servers.

So the current status quo is "total mess, every driver defines their own rules". All I'm trying to do is some common rules here, do make this mess slightly more manageable and overall reviewable and testable.

I have no illusions that this is fundamentally pretty horrible, and the leftover wiggle room for writing memory manager is barely more than a hairline. Just not seeing how other options are better.

So bad news is that gpu's are horrible, but I think if you don't have to review gpu drivers it's substantially better. If you do have hw with full device page fault support, then there's no need to ever install a dma_fence. Punching out device ptes and flushing caches is all that's needed. That is also the plan we have, for the workloads and devices where that's possible.

Now my understanding for rdma is that if you don't have hw page fault support, then the only other object is to more or less permanently pin the memory. So again, dma_fence are completely useless, since it's entirely up to userspace when a given piece of registered memory isn't needed anymore, and the entire problem boils down to how much do we allow random userspace to just pin (system or device) memory. Or at least I don't really see any other solution.

On the other end we have simpler devices like video input/output. Those always need pinned memory, but through hw design it's limited in how much you can pin (generally max resolution times a limited set of buffers to cycle through). Just including that memory pinning allowance as part of device access makes sense.

It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.

tldr; of all this: gpus kinda suck sometimes, but that's also not news :-/

Cheers, Daniel

...

...
The intent of notifiers was never to endlessly block while vast amounts of SW does work.

Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.

It's not everyone, or at least not everywhere, it's some fairly limited cases. Also, even if we drop the mmu_notifier on the floor, then we're stuck with shrinkers and GFP_NOFS. Still need a mempool of some sorts to guarantee you get out of a bind, so not much better.

At least that's my current understanding of where we are across all drivers.

...
...
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).

Right, nor will RDMA ODP.

Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here.

-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

3:28 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:

...

Now my understanding for rdma is that if you don't have hw page fault support,

The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.

...

It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.

But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.

Surely the GPU driver can block and release the notifier directly from its own command processing channel?

Why does this fence and all it entails need to leak out across drivers?

Jason

Daniel Vetter

18 Jun 18 Jun

3 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Wed, Jun 17, 2020 at 12:28:35PM -0300, Jason Gunthorpe wrote:

...

On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:

...
Now my understanding for rdma is that if you don't have hw page fault support,

The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.

...
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.

But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.

Surely the GPU driver can block and release the notifier directly from its own command processing channel?

Why does this fence and all it entails need to leak out across drivers?

So 10 years ago we had this world of every gpu driver is its own bucket, nothing leaks out to the world. But the world had a different idea how gpus where supposed to work, with stuff like:

- laptops with a power-efficient but slow gpu integrated on the cpu die, and a 2nd, much faster but also more wasteful gpu seperately

- also multi-gpu rendering (but on linux we never really got around to enabling that, at least not for 3d rendering)

- soc just bundle IP blocks together, and very often they feel like they have to do their own display block (it's fairly easy and allows you to keep your hw engineers justified on payroll with some more patents they create), but anything more fancy they buy in. So from a driver architecture pov even a single chip soc looks like a bundle of gpus

And you want to pipeline all this because performance, so waiting in userspace for one block to finish before you hand it ever to the other isn't a good idea.

Hence dma_fence as a cross driver leak was created by pulling the gpu completion tracking from the drm/ttm library for managing vram.

Now with glorious hindsight we could have come up with a different approach, where synchronization is managed by userspace, kernel just provides some primitives (kinda like futexes, but for gpu). And the kernel manages residency and gpu pte wrangling entirely seperately. But:

- 10 years ago drivers/gpu was a handful of people at best

- we just finished the massive rewrite to get to a kernel memory manager and kernel modesetting (over 5 years after windows/macos), so appetite for massive rewrites was minimal.

Here we are, now with 50 more drivers built on top and an entire userspace ecosystem that relies on all this (because yes we made dma_fence also the building block for all the cross-process uapi, why wouldn't we).

I hope that explains a bit the history of how and why we ended up here.

Maybe I should do a plumbers talk about "How not to memory manage - cautious tales from drivers/gpu" I think there's a lot of areas where the conversation usually goes "wtf" ... long explanation of history and technical reasons leading to a "oh dear". With a lot of other accelerators and things landing might be good to have a list of things that look tempting (because hey 2% faster) but arent worth the pain. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

5:23 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Thu, Jun 18, 2020 at 05:00:51PM +0200, Daniel Vetter wrote:

...

On Wed, Jun 17, 2020 at 12:28:35PM -0300, Jason Gunthorpe wrote:

...
On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:

...
Now my understanding for rdma is that if you don't have hw page fault support,

The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.

...
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.

But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.

Surely the GPU driver can block and release the notifier directly from its own command processing channel?

Why does this fence and all it entails need to leak out across drivers?

So 10 years ago we had this world of every gpu driver is its own bucket, nothing leaks out to the world. But the world had a different idea how gpus where supposed to work, with stuff like:

Sure, I understand DMA fence, but why does a *notifier* need it?

The job of the notifier is to guarentee that the device it is connected to is not doing DMA before it returns.

That just means you need to prove that device is done with the buffer.

As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)

Jason

Daniel Vetter

19 Jun 19 Jun

7:22 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 8:58 AM Jason Gunthorpe jgg@ziepe.ca wrote:

...

On Thu, Jun 18, 2020 at 05:00:51PM +0200, Daniel Vetter wrote:

...
On Wed, Jun 17, 2020 at 12:28:35PM -0300, Jason Gunthorpe wrote:

...
On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:

...
Now my understanding for rdma is that if you don't have hw page fault support,

The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.

...
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.

But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.

Surely the GPU driver can block and release the notifier directly from its own command processing channel?

Why does this fence and all it entails need to leak out across drivers?

So 10 years ago we had this world of every gpu driver is its own bucket, nothing leaks out to the world. But the world had a different idea how gpus where supposed to work, with stuff like:

Sure, I understand DMA fence, but why does a *notifier* need it?

The job of the notifier is to guarentee that the device it is connected to is not doing DMA before it returns.

That just means you need to prove that device is done with the buffer.

As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)

Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g. - device A renders some stuff. Userspace gets dma_fence A out of that (well sync_file or one of the other uapi interfaces, but you get the idea) - userspace (across process or just different driver) issues more rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this - because unfortunate reasons, the same rendering on device B also needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages - unhappiness ensues, because now the mmu notifier from device B can get hung up on the dma operation device A is doing

If you want to avoid this either a) have less shitty hw (not an option, gpus are gpus, it is slowly getting better though) or b) force userspace to stall before handing over to next device (about as uncool) or c) just pin all the memory always, who cares (also rather unpopular, gpus tend to use all the memory they can get).

I guess the thing with gpus is that dma operations aren't like read/writes for pretty much everything else, but essentially compute contexts (usually implemented as ringbuffers where you stream stuff into) with cross everything dependencies. This even holds within a single gpu, since pretty much all modern gpus have multiple different engines special on different things. And yup that's directly exposed to userspace, for vulkan and other low-level gpu apis even directly to applications. So dma operation for gpu isn't just "done when the read/write finishes", but pulls in an entire chain of dependencies and ordering that needs to happen before it can even start.

-Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

11:39 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:

...

...
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)

Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.

device A renders some stuff. Userspace gets dma_fence A out of that

(well sync_file or one of the other uapi interfaces, but you get the idea)

userspace (across process or just different driver) issues more

rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this

because unfortunate reasons, the same rendering on device B also

needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages

I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!

What if the first device is a page faulting device and doesn't call dma_fence??

It you are going to treat things this way then the mmu notifier really needs to be part of the some core DMA buf, and not randomly sprinkled in drivers

But really this is what page pinning is supposed to be used for, the MM behavior when it blocks on a pinned page is less invasive than if it stalls inside a mmu notifier.

You can mix it, use mmu notififers to keep track if the buffer is still live, but when you want to trigger DMA then pin the pages and keep them pinned until DMA is done. The pin protects things (well, fork is still a problem)

Do not need to wait on dma_fence in notifiers.

Jason

Daniel Vetter

3:06 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 1:39 PM Jason Gunthorpe jgg@ziepe.ca wrote:

...

On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:

...
...
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)

Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.

device A renders some stuff. Userspace gets dma_fence A out of that

(well sync_file or one of the other uapi interfaces, but you get the idea)

userspace (across process or just different driver) issues more

rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this

because unfortunate reasons, the same rendering on device B also

needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages

I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!

The first device might not even have a notifier. This is the 2nd device, waiting on a dma_fence of its own, but which happens to be queued up as a dma operation behind something else.

...

What if the first device is a page faulting device and doesn't call dma_fence??

Not sure what you mean with this ... even if it does page-faulting for some other reasons, it'll emit a dma_fence which the 2nd device can consume as a dependency.

...

It you are going to treat things this way then the mmu notifier really needs to be part of the some core DMA buf, and not randomly sprinkled in drivers

So maybe again unclear, we don't allow such userptr dma-buf to even be shared. They're just for slurping in stuff in the local device (general from file io or something the cpu has done or similar). There have been attempts to use it as the general backing storage, but that didn't go down too well because way too many complications.

Generally most memory the gpu operates on isn't stuff that's mmu_notifier'ed. And also, the device with userptr support only waits for its own dma_fence (because well you can't share this stuff, we disallow that).

The problem is that there's piles of other dependencies for a dma job. GPU doesn't just consume a single buffer each time, it consumes entire lists of buffers and mixes them all up in funny ways. Some of these buffers are userptr, entirely local to the device. Other buffers are just normal device driver allocations (and managed with some shrinker to keep them in check). And then there's the actually shared dma-buf with other devices. The trouble is that they're all bundled up together.

Now we probably should have some helper code for userptr so that all drivers do this roughly the same, but that's just not there yet. But it can't be a dma-buf exporter behind the dma-buf interfaces, because even just pinned get_user_pages would be too different semantics compared to normal shared dma-buf objects, that's all very tightly tied into the specific driver.

...

But really this is what page pinning is supposed to be used for, the MM behavior when it blocks on a pinned page is less invasive than if it stalls inside a mmu notifier.

You can mix it, use mmu notififers to keep track if the buffer is still live, but when you want to trigger DMA then pin the pages and keep them pinned until DMA is done. The pin protects things (well, fork is still a problem)

Hm I thought amdgpu had that (or drm/radeon as the previous incarnation of that stack), and was unhappy about the issues. Would need Christian König to chime in.

...

Do not need to wait on dma_fence in notifiers.

Maybe :-) The goal of this series is more to document current rules and make them more consistent. Fixing them if we don't like them might be a follow-up task, but that would likely be a pile more work. First we need to know what the exact shape of the problem even is. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

3:15 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 05:06:04PM +0200, Daniel Vetter wrote:

...

On Fri, Jun 19, 2020 at 1:39 PM Jason Gunthorpe jgg@ziepe.ca wrote:

...
On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:

...
...
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)

Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.

device A renders some stuff. Userspace gets dma_fence A out of that

(well sync_file or one of the other uapi interfaces, but you get the idea)

userspace (across process or just different driver) issues more

rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this

because unfortunate reasons, the same rendering on device B also

needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages

I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!

The first device might not even have a notifier. This is the 2nd device, waiting on a dma_fence of its own, but which happens to be queued up as a dma operation behind something else.

...
What if the first device is a page faulting device and doesn't call dma_fence??

Not sure what you mean with this ... even if it does page-faulting for some other reasons, it'll emit a dma_fence which the 2nd device can consume as a dependency.

At some point the pages under the buffer have to be either pinned or protected by mmu notifier. So each and every single device doing DMA to these pages must either pin, or use mmu notifier.

Driver A should never 'borrow' a notifier from B

If each driver controls its own lifetime of the buffers, why can't the driver locally wait for its device to finish?

Can't the GPUs cancel work that is waiting on a DMA fence? Ie if Driver A detects that work completed and wants to trigger a DMA fence, but it now knows the buffer is invalidated, can't it tell driver B to give up?

...

The problem is that there's piles of other dependencies for a dma job. GPU doesn't just consume a single buffer each time, it consumes entire lists of buffers and mixes them all up in funny ways. Some of these buffers are userptr, entirely local to the device. Other buffers are just normal device driver allocations (and managed with some shrinker to keep them in check). And then there's the actually shared dma-buf with other devices. The trouble is that they're all bundled up together.

But why does this matter? Does the GPU itself consume some work and then stall internally waiting for an external DMA fence?

Otherwise I would expect this dependency chain should be breakable by aborting work waiting on fences upon invalidation (without stalling)

...

...
Do not need to wait on dma_fence in notifiers.

Maybe :-) The goal of this series is more to document current rules and make them more consistent. Fixing them if we don't like them might be a follow-up task, but that would likely be a pile more work. First we need to know what the exact shape of the problem even is.

Fair enough

Jason

Daniel Vetter

4:19 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 5:15 PM Jason Gunthorpe jgg@ziepe.ca wrote:

...

On Fri, Jun 19, 2020 at 05:06:04PM +0200, Daniel Vetter wrote:

...
On Fri, Jun 19, 2020 at 1:39 PM Jason Gunthorpe jgg@ziepe.ca wrote:

...
On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:

...
...
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)

Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.

device A renders some stuff. Userspace gets dma_fence A out of that

(well sync_file or one of the other uapi interfaces, but you get the idea)

userspace (across process or just different driver) issues more

rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this

because unfortunate reasons, the same rendering on device B also

needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages

I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!

The first device might not even have a notifier. This is the 2nd device, waiting on a dma_fence of its own, but which happens to be queued up as a dma operation behind something else.

...
What if the first device is a page faulting device and doesn't call dma_fence??

Not sure what you mean with this ... even if it does page-faulting for some other reasons, it'll emit a dma_fence which the 2nd device can consume as a dependency.

At some point the pages under the buffer have to be either pinned or protected by mmu notifier. So each and every single device doing DMA to these pages must either pin, or use mmu notifier.

Driver A should never 'borrow' a notifier from B

It doesn't. I guess this would be great topic for lpc with a seriously big white-board, but I guess we don't have that this year again, so let me try again. Simplified example ofc, but should be the gist.

Ingredients: Device A and Device B A dma-buf, shared between device A and device B, let's call that shared_buf A userptr buffer, which userspace created on device B to hopefully somewhat track a virtual memory range, let's call that userptr_buf. A pile of other buffers, but we pretend they don't exist (because they kinda don't matter.

Sequence of events as userspace issues them to the kernel. 1. dma operation on device A, which fills some interesting stuff into shared_buf. Userspace gets back a handle to dma_fence fence_A. No mmu notifier anywhere to be seen in the driver for device A.

2. userspace passes fence_A around to some other place

3. other places takes the handle for shared_buf and fence_A and userptr_buf and starts a dma operation on device B. It's one dma operation, maybe device B is taking the data from shared_buf and compresses it into userptr_buf, so that userspace can then send it over the network or to disk or whatever. device B has a mmu_notifier. Userspace gets back fence_B, which represents this dma operation. The kernel also stuffs this fence_B into the mmu_range_notifier for userptr_buf.

-> at this point device A might still be crunching the numbers

4. device A is finally done doing whatever it was supposed to do, and fence_A completes

5. device B wakes up (this might or might not involve the kernel, usually it does) since fence_A has completed, and now starts doing its own crunching.

6. once device B is also done, it signals fence_B

In all this device A has never borrowed the mmu notifier or even accessd the memory in userptr_buf or had access to that buffer handle.

The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

...

If each driver controls its own lifetime of the buffers, why can't the driver locally wait for its device to finish?

Can't the GPUs cancel work that is waiting on a DMA fence? Ie if Driver A detects that work completed and wants to trigger a DMA fence, but it now knows the buffer is invalidated, can't it tell driver B to give up?

We can (usually, the shitty hw where we can't has generally disappeared) with gpu reset. Users make really sad faces when that happens though, and generally they're only ok with that if it's indeed a nasty gpu program that resulted in the crash (there's some webgl shaders that run too long for quick&easy testing of how good the gpu reset is, don't do that if you care about the data in your desktop session ...).

The trouble is that userspace assembles the work that's queued up on the gpu. After submission everyone has forgotten enough that just canceling stuff and re-issuing everything isn't on the table.

Some hw is better, with real hw page faults and stuff, but those also don't need dma_fence to track their memory. But generally just not possible.

...

...
The problem is that there's piles of other dependencies for a dma job. GPU doesn't just consume a single buffer each time, it consumes entire lists of buffers and mixes them all up in funny ways. Some of these buffers are userptr, entirely local to the device. Other buffers are just normal device driver allocations (and managed with some shrinker to keep them in check). And then there's the actually shared dma-buf with other devices. The trouble is that they're all bundled up together.

But why does this matter? Does the GPU itself consume some work and then stall internally waiting for an external DMA fence?

Yup, see above, that's what's going on. Userspace queues up distributed work across engines & drivers, and then just waits for the entire thing to cascade and finish.

...

Otherwise I would expect this dependency chain should be breakable by aborting work waiting on fences upon invalidation (without stalling)

Yup, it would. Now on some hw you have a gpu work scheduler that sits in some kthread, and you could probably unschedule the work if there's some external dependency and you get an mmu notifier callback. Then put it on some queue, re-acquire the user pages and then reschedule it.

It's still as horrible, since you still have the wait for the completion in there, the only benefit is that other device drivers without userptr support don't have to live with that specific constraint. dma_fence rules are still very strict and easy to deadlock, so we'd still want some lockdep checks, but now you'd have to somehow annotate whether you're a driver with userptr or a driver without userptr and make sure everyone gets it right.

Also a scheduler which can unschedule and reschedule is mighty more complex than one which cannot, plus it needs to do that from mmu notifier callback (not the nicest calling context we have in the kernel by far). And if you have a single driver which doesn't unschedule, you're still screwed from an overall subsystem pov.

So lots of code, lots of work, and not that much motivation to roll it out consistently across the board since there's no incremental payoff. Plus the thing is, the drivers without userptr are generally the really simple ones. Much easier to just fix those than to change the big complex render beasts which want userptr :-)

E.g. the atomic modeset framework we've rolled out in the past few years and that almost all display drivers now use pulls any (sleeping) locks and memory allocations out of the critical async work section by design. Some drivers still managed to butcher it (the annotations caught some locking bugs already, not just memory allocations in the wrong spot), but generally easy to fix those.

...

...
...
Do not need to wait on dma_fence in notifiers.

Maybe :-) The goal of this series is more to document current rules and make them more consistent. Fixing them if we don't like them might be a follow-up task, but that would likely be a pile more work. First we need to know what the exact shape of the problem even is.

Fair enough

Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But

- that needs minimally reliable preempt support for gpu work, and hw engineers seem to have a hard time with that (or just don't want to do it). hw page faults would be even better, and even more wishlist than reality if you expect it to work everywhere.

- it'd be a complete break of the established userspace abi, including all the cross driver stuff. Which means it's not just some in-kernel refactoring, we need to rev the entire ecosystem. And that takes a very long time, and needs serious pressure to get people moving.

E.g. the atomic modeset rework is still not yet rolled out to major linux desktop environments, and it's over 5 years old, and it's starting to seriously hurt because lots of performance features require atomic modeset in userspace to be able to use them. I think rev'ing the entire memory management support will take as long. Plus I don't think we can ditch the old ways - even if all the hw currently using this would be dead (and we can delete the drivers) there's still the much smaller gpus in SoC that also need to go through the entire evolution. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jason Gunthorpe

5:23 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...

The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

...

Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But

I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.

Jason

Jerome Glisse

6:09 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...

On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).

So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.

I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.

...

...
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But

I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.

Yes for user that expect HMM they need to be asynchronous. But it is hard to revert user ptr has it was done a long time ago.

Cheers, Jérôme

Jason Gunthorpe

6:18 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...

On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

(isn't this broken for O_DIRECT as well anyhow?)

How does mmu_notifiers help the fork case anyhow? Block fork from progressing?

...

I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.

It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..

...

The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.

Make sense

I'm not sure I saw this in the AMD hmm stuff - it would be good if someone would look at that. Every time I do it looks like the locking is wrong.

Jason

Felix Kuehling

7:48 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:

...

On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

(isn't this broken for O_DIRECT as well anyhow?)

How does mmu_notifiers help the fork case anyhow? Block fork from progressing?

How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.

With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.

With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.

In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.

Regards, Felix

...

...
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.

It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..

...
The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.

Make sense

I'm not sure I saw this in the AMD hmm stuff - it would be good if someone would look at that. Every time I do it looks like the locking is wrong.

Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Jason Gunthorpe

7:55 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:

...

Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:

...
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

(isn't this broken for O_DIRECT as well anyhow?)

How does mmu_notifiers help the fork case anyhow? Block fork from progressing?

How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.

With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.

With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.

In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.

The model here in fork has been wrong for a long time, and I do wonder how O_DIRECT manages to not be broken too.. I guess the time windows there are too small to get unlucky.

If you have a write pin on a page then it should not be COW'd into the fork'd process but copied with the originating page remaining with the original mm.

I wonder if there is some easy way to achive that - if that is the main reason to use notifiers then it would be a better solution.

Jason

Felix Kuehling

8:03 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

Am 2020-06-19 um 3:55 p.m. schrieb Jason Gunthorpe:

...

On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:

...
Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:

...
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

(isn't this broken for O_DIRECT as well anyhow?)

How does mmu_notifiers help the fork case anyhow? Block fork from progressing?

How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.

With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.

With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.

In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.

The model here in fork has been wrong for a long time, and I do wonder how O_DIRECT manages to not be broken too.. I guess the time windows there are too small to get unlucky.

If you have a write pin on a page then it should not be COW'd into the fork'd process but copied with the originating page remaining with the original mm.

I wonder if there is some easy way to achive that - if that is the main reason to use notifiers then it would be a better solution.

Other than the application changing its own virtual address mappings (mprotect, munmap, etc.), triggering MMU notifiers, we also get MMU notifiers from THP worker threads, and NUMA balancing.

When we start doing migration to DEVICE_PRIVATE memory with HMM, we also get MMU notifiers during those driver-initiated migrations.

Regards, Felix

...

Jason

Jerome Glisse

8:31 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 04:55:38PM -0300, Jason Gunthorpe wrote:

...

On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:

...
Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:

...
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

(isn't this broken for O_DIRECT as well anyhow?)

How does mmu_notifiers help the fork case anyhow? Block fork from progressing?

How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.

With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.

With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.

In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.

The model here in fork has been wrong for a long time, and I do wonder how O_DIRECT manages to not be broken too.. I guess the time windows there are too small to get unlucky.

This was discuss extensively in the GUP works John have been doing. Yes O_DIRECT can potentialy break but only if you are writting to COW pages and you initiated the O_DIRECT right before the fork and GUP happen before fork was able to write protect the pages.

If you O_DIRECT but use memory as input ie you are writting the memory to the file not reading from the file. Then fork is harmless as you are just reading memory. You can still face the COW uncertainty (the process against which you did the O_DIRECT get "new" pages but your O_DIRECT goes on with the "old" pages) but doing O_DIRECT and fork concurently is asking for trouble.

...

If you have a write pin on a page then it should not be COW'd into the fork'd process but copied with the originating page remaining with the original mm.

I wonder if there is some easy way to achive that - if that is the main reason to use notifiers then it would be a better solution.

Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.

Now if read only GUP on P0 happens in C (or B everything is symetrical in respect to root A) then P0 might not be the page that is in C after the GUP ie if something in C write to the virtual address corresponding to P0 then a new page might get allocated and the virtual address will no longer point to P0 for C.

Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some what "fix" that but GUP fast is still succeptible to this.

Note that above commit only address the GUP after/while forking. GUP before fork() need mmu notifier (or forcing page copy instead of COW).

Cheers, Jérôme

Jason Gunthorpe

22 Jun 22 Jun

11:46 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:

...

Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.

For a long time now RDMA has broken COW pages when creating user DMA regions.

The problem has been that fork re-COW's regions that had their COW broken.

So, if you break the COW upon mapping and prevent fork (and others) from copying DMA pinned then you'd cover the cases.

...

Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some what "fix" that but GUP fast is still succeptible to this.

Ah, so everyone breaks the COW now, not just RDMA..

What do you mean 'GUP fast is still succeptible to this' ?

Jason

Jerome Glisse

8:15 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:

...

On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:

...
Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.

For a long time now RDMA has broken COW pages when creating user DMA regions.

The problem has been that fork re-COW's regions that had their COW broken.

So, if you break the COW upon mapping and prevent fork (and others) from copying DMA pinned then you'd cover the cases.

I am not sure we want to prevent COW for pinned GUP pages, this would change current semantic and potentialy break/slow down existing apps.

Anyway i think we focus too much on fork/COW, it is just an unfixable broken corner cases, mmu notifier allows you to avoid it. Forcing real copy on fork would likely be seen as regression by most people.

...

...
Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some what "fix" that but GUP fast is still succeptible to this.

Ah, so everyone breaks the COW now, not just RDMA..

What do you mean 'GUP fast is still succeptible to this' ?

Not all GUP fast path are updated (intentionaly) __get_user_pages_fast() for instance still keeps COW intact. People using GUP should really knows what they are doing.

Cheers, Jérôme

Jason Gunthorpe

23 Jun 23 Jun

12:02 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Mon, Jun 22, 2020 at 04:15:40PM -0400, Jerome Glisse wrote:

...

On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:

...
Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.

For a long time now RDMA has broken COW pages when creating user DMA regions.

The problem has been that fork re-COW's regions that had their COW broken.

So, if you break the COW upon mapping and prevent fork (and others) from copying DMA pinned then you'd cover the cases.

I am not sure we want to prevent COW for pinned GUP pages, this would change current semantic and potentialy break/slow down existing apps.

Isn't that basically exactly what 17839856fd588 does? It looks like it uses the same approach RDMA does by sticking FOLL_WRITE even though it is a read action.

After that change the reamining bug is that fork can re-establish a COW./

...

Anyway i think we focus too much on fork/COW, it is just an unfixable broken corner cases, mmu notifier allows you to avoid it. Forcing real copy on fork would likely be seen as regression by most people.

If you don't copy the there are data corruption bugs though. Real apps probably don't hit a problem here as they are not forking while GUP's are active (RDMA excluded, which does do this)

I think that implementing page pinning by blocking mmu notifiers for the duration of the pin is a particularly good idea either, that actually seems a lot worse than just having the pin in the first place.

Particularly if it is only being done to avoid corner case bugs that already afflict other GUP cases :(

...

...
What do you mean 'GUP fast is still succeptible to this' ?

Not all GUP fast path are updated (intentionaly) __get_user_pages_fast()

Sure, that is is the 'raw' accessor

Jason

Jerome Glisse

19 Jun 19 Jun

8:10 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:

...

On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

Just no way to deal with it easily, i thought about forcing the anon_vma (page->mapping for anonymous page) to the anon_vma that belongs to the vma against which the GUP was done but it would break things if page is already in other branch of a fork tree. Also this forbid fast GUP.

Quite frankly the fork was not the main motivating factor. GPU can pin potentialy GBytes of memory thus we wanted to be able to release it but since Michal changes to reclaim code this is no longer effective.

User buffer should never end up in those weird corner case, iirc the first usage was for xorg exa texture upload, then generalize to texture upload in mesa and latter on to more upload cases (vertices, ...). At least this is what i remember today. So in those cases we do not expect fork, splice, mremap, mprotect, ...

Maybe we can audit how user ptr buffer are use today and see if we can define a usage pattern that would allow to cut corner in kernel. For instance we could use mmu notifier just to block CPU pte update while we do GUP and thus never wait on dma fence.

Then GPU driver just keep the GUP pin around until they are done with the page. They can also use the mmu notifier to keep a flag so that the driver know if it needs to redo a GUP ie:

The notifier path: GPU_mmu_notifier_start_callback(range) gpu_lock_cpu_pagetable(range) for_each_bo_in(bo, range) { bo->need_gup = true; } gpu_unlock_cpu_pagetable(range)

GPU_validate_buffer_pages(bo) if (!bo->need_gup) return; put_pages(bo->pages); range = bo_vaddr_range(bo) gpu_lock_cpu_pagetable(range) GUP(bo->pages, range) gpu_unlock_cpu_pagetable(range)

Depending on how user_ptr are use today this could work.

...

(isn't this broken for O_DIRECT as well anyhow?)

Yes it can in theory, if you have an application that does O_DIRECT and fork concurrently (ie O_DIRECT in one thread and fork in another). Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast was able to lookup a page with write permission before fork had the chance to update it to read only for COW.

But doing O_DIRECT (or anything that use GUP fast) in one thread and fork in another is inherently broken ie there is no way to fix it.

See 17839856fd588f4ab6b789f482ed3ffd7c403e1f

...

How does mmu_notifiers help the fork case anyhow? Block fork from progressing?

It enforce ordering between fork and GUP, if fork is first it blocks GUP and if forks is last then fork waits on GUP and then user buffer get invalidated.

...

...
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.

It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..

Well my POV is that if you abide by rules HMM defined then you do not need to pin pages. The rule is asynchronous device page table update.

Pinning pages is problematic it blocks many core mm features and it is just bad all around. Also it is inherently broken in front of fork/mremap/splice/...

Cheers, Jérôme

Daniel Vetter

8:43 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse jglisse@redhat.com wrote:

...

On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

Just no way to deal with it easily, i thought about forcing the anon_vma (page->mapping for anonymous page) to the anon_vma that belongs to the vma against which the GUP was done but it would break things if page is already in other branch of a fork tree. Also this forbid fast GUP.

Quite frankly the fork was not the main motivating factor. GPU can pin potentialy GBytes of memory thus we wanted to be able to release it but since Michal changes to reclaim code this is no longer effective.

What where how? My patch to annote reclaim paths with mmu notifier possibility just landed in -mm, so if direct reclaim can't reclaim mmu notifier'ed stuff anymore we need to know.

Also this would resolve the entire pain we're discussing in this thread about dma_fence_wait deadlocking against anything that's not GFP_ATOMIC ... -Daniel

...

User buffer should never end up in those weird corner case, iirc the first usage was for xorg exa texture upload, then generalize to texture upload in mesa and latter on to more upload cases (vertices, ...). At least this is what i remember today. So in those cases we do not expect fork, splice, mremap, mprotect, ...

Maybe we can audit how user ptr buffer are use today and see if we can define a usage pattern that would allow to cut corner in kernel. For instance we could use mmu notifier just to block CPU pte update while we do GUP and thus never wait on dma fence.

Then GPU driver just keep the GUP pin around until they are done with the page. They can also use the mmu notifier to keep a flag so that the driver know if it needs to redo a GUP ie:

The notifier path: GPU_mmu_notifier_start_callback(range) gpu_lock_cpu_pagetable(range) for_each_bo_in(bo, range) { bo->need_gup = true; } gpu_unlock_cpu_pagetable(range)

GPU_validate_buffer_pages(bo) if (!bo->need_gup) return; put_pages(bo->pages); range = bo_vaddr_range(bo) gpu_lock_cpu_pagetable(range) GUP(bo->pages, range) gpu_unlock_cpu_pagetable(range)

Depending on how user_ptr are use today this could work.

...
(isn't this broken for O_DIRECT as well anyhow?)

Yes it can in theory, if you have an application that does O_DIRECT and fork concurrently (ie O_DIRECT in one thread and fork in another). Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast was able to lookup a page with write permission before fork had the chance to update it to read only for COW.

But doing O_DIRECT (or anything that use GUP fast) in one thread and fork in another is inherently broken ie there is no way to fix it.

See 17839856fd588f4ab6b789f482ed3ffd7c403e1f

...
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?

It enforce ordering between fork and GUP, if fork is first it blocks GUP and if forks is last then fork waits on GUP and then user buffer get invalidated.

...
...
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.

It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..

Well my POV is that if you abide by rules HMM defined then you do not need to pin pages. The rule is asynchronous device page table update.

Pinning pages is problematic it blocks many core mm features and it is just bad all around. Also it is inherently broken in front of fork/mremap/splice/...

Cheers, Jérôme

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jerome Glisse

8:59 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 10:43:20PM +0200, Daniel Vetter wrote:

...

On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse jglisse@redhat.com wrote:

...
On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?

Just no way to deal with it easily, i thought about forcing the anon_vma (page->mapping for anonymous page) to the anon_vma that belongs to the vma against which the GUP was done but it would break things if page is already in other branch of a fork tree. Also this forbid fast GUP.

Quite frankly the fork was not the main motivating factor. GPU can pin potentialy GBytes of memory thus we wanted to be able to release it but since Michal changes to reclaim code this is no longer effective.

What where how? My patch to annote reclaim paths with mmu notifier possibility just landed in -mm, so if direct reclaim can't reclaim mmu notifier'ed stuff anymore we need to know.

Also this would resolve the entire pain we're discussing in this thread about dma_fence_wait deadlocking against anything that's not GFP_ATOMIC ...

Sorry my bad, reclaim still works, only oom skip. It was couple years ago and i thought that some of the things discuss while back did make it upstream.

It is probably a good time to also point out that what i wanted to do is have all the mmu notifier callback provide some kind of fence (not dma fence) so that we can split the notification into step: A- schedule notification on all devices/system get fences this step should minimize lock dependency and should not have to wait for anything also best if you can avoid memory allocation for instance by pre-allocating what you need for notification. B- mm can do things like unmap but can not map new page so write special swap pte to cpu page table C- wait on each fences from A ... resume old code ie replace pte or finish unmap ...

The idea here is that at step C the core mm can decide to back off if any fence returned from A have to wait. This means that every device is invalidating for nothing but if we get there then it might still be a good thing as next time around maybe the kernel would be successfull without a wait.

This would allow things like reclaim to make forward progress and skip over or limit wait time to given timeout.

Also I thought to extend this even to multi-cpu tlb flush so that device and CPUs follow same pattern and we can make // progress on each.

Getting to such scheme is a lot of work. My plan was to first get the fence as part of the notifier user API and hide it from mm inside notifier common code. Then update each core mm path to new model and see if there is any benefit from it. Reclaim would be first candidate.

Cheers, Jérôme

Jason Gunthorpe

23 Jun 23 Jun

12:05 a.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 04:10:11PM -0400, Jerome Glisse wrote:

...

Maybe we can audit how user ptr buffer are use today and see if we can define a usage pattern that would allow to cut corner in kernel. For instance we could use mmu notifier just to block CPU pte update while we do GUP and thus never wait on dma fence.

The DMA fence is the main problem, if you can think of a way to avoid it then it would be great!

...

Then GPU driver just keep the GUP pin around until they are done with the page. They can also use the mmu notifier to keep a flag so that the driver know if it needs to redo a GUP ie:

The notifier path: GPU_mmu_notifier_start_callback(range) gpu_lock_cpu_pagetable(range) for_each_bo_in(bo, range) { bo->need_gup = true; } gpu_unlock_cpu_pagetable(range)

So some kind of invalidation tracking? But this doesn't solve COW and Fork problem?

...

...
It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..

Well my POV is that if you abide by rules HMM defined then you do not need to pin pages. The rule is asynchronous device page table update.

I think one of the hmm rules is to not block notifiers for a long time, which these scheme seem to violate already.

Pinning for a long time is less bad than blocing notifiers for a long time, IMHO

Jason

Alex Deucher

19 Jun 19 Jun

7:11 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse jglisse@redhat.com wrote:

...

On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).

So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.

I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.

All devices which support recoverable page faults also have a dedicated paging engine for the kernel driver which the driver already makes use of. We can also update the GPU page tables with the CPU.

Alex

...

...
...
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But

I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.

Yes for user that expect HMM they need to be asynchronous. But it is hard to revert user ptr has it was done a long time ago.

Cheers, Jérôme

amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Felix Kuehling

7:30 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:

...

On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse jglisse@redhat.com wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).

So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.

I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.

All devices which support recoverable page faults also have a dedicated paging engine for the kernel driver which the driver already makes use of. We can also update the GPU page tables with the CPU.

We have a potential problem with CPU updating page tables while the GPU is retrying on page table entries because 64 bit CPU transactions don't arrive in device memory atomically.

We are using SDMA for page table updates. This currently goes through a the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode only. But since it's based on the DRM GPU scheduler, we do use dma-fence to wait for completion.

Regards, Felix

...

Alex

...
...
...
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But

I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.

Yes for user that expect HMM they need to be asynchronous. But it is hard to revert user ptr has it was done a long time ago.

Cheers, Jérôme

amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Jerome Glisse

7:40 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:

...

Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:

...
On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse jglisse@redhat.com wrote:

...
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:

...
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

...
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).

So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.

I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.

All devices which support recoverable page faults also have a dedicated paging engine for the kernel driver which the driver already makes use of. We can also update the GPU page tables with the CPU.

We have a potential problem with CPU updating page tables while the GPU is retrying on page table entries because 64 bit CPU transactions don't arrive in device memory atomically.

We are using SDMA for page table updates. This currently goes through a the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode only. But since it's based on the DRM GPU scheduler, we do use dma-fence to wait for completion.

Yeah my worry is mostly that some cross dma fence leak into it but it should never happen realy, maybe there is a way to catch if it does and print a warning.

So yes you can use dma fence, as long as they do not have cross-dep. Another expectation is that they complete quickly and usualy page table update do.

Cheers, Jérôme

Jason Gunthorpe

7:51 p.m.

New subject: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:

...

We have a potential problem with CPU updating page tables while the GPU is retrying on page table entries because 64 bit CPU transactions don't arrive in device memory atomically.

Except for 32 bit platforms atomicity is guarenteed if you use uncached writeq() to aligned addresses..

The linux driver model breaks of the writeX() stuff is not atomic.

Jason

Daniel Vetter

12 Jun 12 Jun

7:01 a.m.

New subject: [PATCH] dma-fence: prime lockdep annotations

Two in one go: - it is allowed to call dma_fence_wait() while holding a dma_resv_lock(). This is fundamental to how eviction works with ttm, so required.

The one big locking context we're leaving out here is mmu notifiers, added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Aug 26 22:14:21 2019 +0200

mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to wait on dma-fences from mmu notifiers. But there's no ready-made functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v4: A spelling fix from Mika

v5: #ifdef for CONFIG_MMU_NOTIFIER. Reported by 0day. Unfortunately this means lockdep enforcement is slightly inconsistent, it won't spot GFP_NOIO and GFP_NOFS allocations in the wrong spot if CONFIG_MMU_NOTIFIER is disabled in the kernel config. Oh well.

Cc: kernel test robot lkp@intel.com Reviewed-by: Thomas Hellström thomas.hellstrom@intel.com (v4) Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 6 ++++ drivers/dma-buf/dma-fence.c | 41 ++++++++++++++++++++++++++++ drivers/dma-buf/dma-resv.c | 8 ++++++ include/linux/dma-fence.h | 1 + 4 files changed, 56 insertions(+)

+/** + * DOC: fence cross-driver contract + * + * Since &dma_fence provide a cross driver contract, all drivers must follow the + * same rules: + * + * * Fences must complete in a reasonable time. Fences which represent kernels + * and shaders submitted by userspace, which could run forever, must be backed + * up by timeout and gpu hang recovery code. Minimally that code must prevent + * further command submission and force complete all in-flight fences, e.g. + * when the driver or hardware do not support gpu reset, or if the gpu reset + * failed for some reason. Ideally the driver supports gpu recovery which only + * affects the offending userspace context, and no other userspace + * submissions. + * + * * Drivers may have different ideas of what completion within a reasonable + * time means. Some hang recovery code uses a fixed timeout, others a mix + * between observing forward progress and increasingly strict timeouts. + * Drivers should not try to second guess timeout handling of fences from + * other drivers. + * + * * To ensure there's no deadlocks of dma_fence_wait() against other locks + * drivers should annotate all code required to reach dma_fence_signal(), + * which completes the fences, with dma_fence_begin_signalling() and + * dma_fence_end_signalling(). + * + * * Drivers are allowed to call dma_fence_wait() while holding dma_resv_lock(). + * This means any code required for fence completion cannot acquire a + * &dma_resv lock. Note that this also pulls in the entire established + * locking hierarchy around dma_resv_lock() and dma_resv_unlock(). + * + * * Drivers are allowed to call dma_fence_wait() from their &shrinker + * callbacks. This means any code required for fence completion cannot + * allocate memory with GFP_KERNEL. + * + * * Drivers are allowed to call dma_fence_wait() from their &mmu_notifier + * respectively &mmu_interval_notifier callbacks. This means any code required + * for fence completeion cannot allocate memory with GFP_NOFS or GFP_NOIO. + * Only GFP_ATOMIC is permissible, which might fail. + */ + static const char *dma_fence_stub_get_name(struct dma_fence *fence) { return "stub"; diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c index 99c0a33c918d..51f0583ead19 100644 --- a/drivers/dma-buf/dma-resv.c +++ b/drivers/dma-buf/dma-resv.c @@ -35,6 +35,7 @@ #include <linux/dma-resv.h> #include <linux/export.h> #include <linux/sched/mm.h> +#include <linux/mmu_notifier.h>

/** * DOC: Reservation Object Overview @@ -115,6 +116,13 @@ static int __init dma_resv_lockdep(void) if (ret == -EDEADLK) dma_resv_lock_slow(&obj, &ctx); fs_reclaim_acquire(GFP_KERNEL); +#ifdef CONFIG_MMU_NOTIFIER + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + __dma_fence_might_wait(); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); +#else + __dma_fence_might_wait(); +#endif fs_reclaim_release(GFP_KERNEL); ww_mutex_unlock(&obj.lock); ww_acquire_fini(&ctx); diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3f288f7db2ef..09e23adb351d 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) #ifdef CONFIG_LOCKDEP bool dma_fence_begin_signalling(void); void dma_fence_end_signalling(bool cookie); +void __dma_fence_might_wait(void); #else static inline bool dma_fence_begin_signalling(void) {

-- 2.26.2

Daniel Vetter

4 Jun 4 Jun

8:12 a.m.

New subject: [PATCH 05/18] drm/vkms: Annotate vblank timer

This is needed to signal the fences from page flips, annotate it accordingly. We need to annotate entire timer callback since if we get stuck anywhere in there, then the timer stops, and hence fences stop. Just annotating the top part that does the vblank handling isn't enough.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Rodrigo Siqueira rodrigosiqueiramelo@gmail.com Cc: Haneen Mohammed hamohammed.sa@gmail.com Cc: Daniel Vetter daniel@ffwll.ch --- drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c index ac85e17428f8..a53a40848a72 100644 --- a/drivers/gpu/drm/vkms/vkms_crtc.c +++ b/drivers/gpu/drm/vkms/vkms_crtc.c @@ -1,5 +1,7 @@ // SPDX-License-Identifier: GPL-2.0+

+#include <linux/dma-fence.h> + #include <drm/drm_atomic.h> #include <drm/drm_atomic_helper.h> #include <drm/drm_probe_helper.h> @@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer) struct drm_crtc *crtc = &output->crtc; struct vkms_crtc_state *state; u64 ret_overrun; - bool ret; + bool ret, fence_cookie; + + fence_cookie = dma_fence_begin_signalling();

ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer, output->period_ns); @@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer) DRM_DEBUG_DRIVER("Composer worker already queued\n"); }

+ dma_fence_end_signalling(fence_cookie); + return HRTIMER_RESTART; }

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 06/18] drm/vblank: Annotate with dma-fence signalling section

This is rather overkill since currently all drivers call this from hardirq (or at least timers). But maybe in the future we're going to have thread irq handlers and what not, doesn't hurt to be prepared. Plus this is an easy start for sprinkling these fence annotations into shared code.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/drm_vblank.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c index 85e5f2db1608..93a5bba5f665 100644 --- a/drivers/gpu/drm/drm_vblank.c +++ b/drivers/gpu/drm/drm_vblank.c @@ -24,6 +24,7 @@ * OTHER DEALINGS IN THE SOFTWARE. */

+#include <linux/dma-fence.h> #include <linux/export.h> #include <linux/moduleparam.h>

@@ -1908,7 +1909,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe) { struct drm_vblank_crtc *vblank = &dev->vblank[pipe]; unsigned long irqflags; - bool disable_irq; + bool disable_irq, fence_cookie;

if (drm_WARN_ON_ONCE(dev, !drm_dev_has_vblank(dev))) return false; @@ -1916,6 +1917,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe) if (drm_WARN_ON(dev, pipe >= dev->num_crtcs)) return false;

+ fence_cookie = dma_fence_begin_signalling(); + spin_lock_irqsave(&dev->event_lock, irqflags);

/* Need timestamp lock to prevent concurrent execution with @@ -1928,6 +1931,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe) if (!vblank->enabled) { spin_unlock(&dev->vblank_time_lock); spin_unlock_irqrestore(&dev->event_lock, irqflags); + dma_fence_end_signalling(fence_cookie); return false; }

@@ -1953,6 +1957,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe) if (disable_irq) vblank_disable_fn(&vblank->disable_timer);

+ dma_fence_end_signalling(fence_cookie); + return true; } EXPORT_SYMBOL(drm_handle_vblank);

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 07/18] drm/atomic-helper: Add dma-fence annotations

This is a bit disappointing since we need to split the annotations over all the different parts.

I was considering just leaking the critical section into the ->atomic_commit_tail callback of each driver. But that would mean we need to pass the fence_cookie into each driver (there's a total of 13 implementations of this hook right now), so bad flag day. And also a bit leaky abstraction.

Hence just do it function-by-function.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 7cd7fe0d57b4..bfcc7857a9a1 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1549,6 +1549,7 @@ EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done); void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state) { struct drm_device *dev = old_state->dev; + bool fence_cookie = dma_fence_begin_signalling();

drm_atomic_helper_commit_modeset_disables(dev, old_state);

@@ -1560,6 +1561,8 @@ void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)

drm_atomic_helper_commit_hw_done(old_state);

+ dma_fence_end_signalling(fence_cookie); + drm_atomic_helper_wait_for_vblanks(dev, old_state);

drm_atomic_helper_cleanup_planes(dev, old_state); @@ -1579,6 +1582,7 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_tail); void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state) { struct drm_device *dev = old_state->dev; + bool fence_cookie = dma_fence_begin_signalling();

drm_atomic_helper_commit_modeset_disables(dev, old_state);

@@ -1591,6 +1595,8 @@ void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)

drm_atomic_helper_commit_hw_done(old_state);

+ dma_fence_end_signalling(fence_cookie); + drm_atomic_helper_wait_for_vblanks(dev, old_state);

drm_atomic_helper_cleanup_planes(dev, old_state); @@ -1606,6 +1612,9 @@ static void commit_tail(struct drm_atomic_state *old_state) ktime_t start; s64 commit_time_ms; unsigned int i, new_self_refresh_mask = 0; + bool fence_cookie; + + fence_cookie = dma_fence_begin_signalling();

funcs = dev->mode_config.helper_private;

@@ -1634,6 +1643,8 @@ static void commit_tail(struct drm_atomic_state *old_state) if (new_crtc_state->self_refresh_active) new_self_refresh_mask |= BIT(i);

+ dma_fence_end_signalling(fence_cookie); + if (funcs && funcs->atomic_commit_tail) funcs->atomic_commit_tail(old_state); else @@ -1789,6 +1800,7 @@ int drm_atomic_helper_commit(struct drm_device *dev, bool nonblock) { int ret; + bool fence_cookie;

if (state->async_update) { ret = drm_atomic_helper_prepare_planes(dev, state); @@ -1811,6 +1823,8 @@ int drm_atomic_helper_commit(struct drm_device *dev, if (ret) return ret;

+ fence_cookie = dma_fence_begin_signalling(); + if (!nonblock) { ret = drm_atomic_helper_wait_for_fences(dev, state, true); if (ret) @@ -1848,6 +1862,7 @@ int drm_atomic_helper_commit(struct drm_device *dev, */

drm_atomic_state_get(state); + dma_fence_end_signalling(fence_cookie); if (nonblock) queue_work(system_unbound_wq, &state->commit_work); else @@ -1856,6 +1871,7 @@ int drm_atomic_helper_commit(struct drm_device *dev, return 0;

err: + dma_fence_end_signalling(fence_cookie); drm_atomic_helper_cleanup_planes(dev, state); return ret; }

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 08/18] drm/amdgpu: add dma-fence annotations to atomic commit path

I need a canary in a ttm-based atomic driver to make sure the dma_fence_begin/end_signalling annotations actually work.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index bdba0bfd6df1..adabfa929f42 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -57,6 +57,7 @@

#include "ivsrcid/ivsrcid_vislands30.h"

+#include <linux/module.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/version.h> @@ -7320,6 +7321,9 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state) struct drm_connector_state *old_con_state, *new_con_state; struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state; int crtc_disable_count = 0; + bool fence_cookie; + + fence_cookie = dma_fence_begin_signalling();

drm_atomic_helper_update_legacy_modeset_state(dev, state);

@@ -7600,6 +7604,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state) /* Signal HW programming completion */ drm_atomic_helper_commit_hw_done(state);

+ dma_fence_end_signalling(fence_cookie); + if (wait_for_vblank) drm_atomic_helper_wait_for_flip_done(dev, state);

-- 2.26.2

Daniel Vetter

23 Jun 23 Jun

10:51 a.m.

New subject: [PATCH 08/18] drm/amdgpu: add dma-fence annotations to atomic commit path

Hi Roland & vmwgfx maintainers,

Thomas has played around with these annotations on his vmwgfx setup, and found some issues. Apparently in the atomic_commit_tail path when handling the dirty rectangle stuff you acquire a ttm reservation, which is a no-go since it could deadlock with other paths - atomic commits can produce a dma_fence.

This patch here highlights that with the new annotations, and apparently causes a lockdep splat if you go through the dirty rect paths (not sure if it also happens otherwise, Thomas can fill you in with the details).

Can you pls take a look at this? I'm happy to help out with analyzing any lockdep splats. For actual fixes Thomas is better since I don't understand a lot of how drm/vmwgfx works internally.

Cheers, Daniel

On Thu, Jun 4, 2020 at 10:12 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:

...

I need a canary in a ttm-based atomic driver to make sure the dma_fence_begin/end_signalling annotations actually work.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com

drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index bdba0bfd6df1..adabfa929f42 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -57,6 +57,7 @@

#include "ivsrcid/ivsrcid_vislands30.h"

+#include <linux/module.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/version.h> @@ -7320,6 +7321,9 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state) struct drm_connector_state *old_con_state, *new_con_state; struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state; int crtc_disable_count = 0;
  bool fence_cookie;
  fence_cookie = dma_fence_begin_signalling();

  drm_atomic_helper_update_legacy_modeset_state(dev, state);
@@ -7600,6 +7604,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state) /* Signal HW programming completion */ drm_atomic_helper_commit_hw_done(state);
  dma_fence_end_signalling(fence_cookie);
  if (wait_for_vblank)
          drm_atomic_helper_wait_for_flip_done(dev, state);
-- 2.26.2

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

4 Jun 4 Jun

8:12 a.m.

New subject: [PATCH 09/18] drm/scheduler: use dma-fence annotations in main thread

If the scheduler rt thread gets stuck on a mutex that we're holding while waiting for gpu workloads to complete, we have a problem.

Add dma-fence annotations so that lockdep can check this for us.

I've tried to quite carefully review this, and I think it's at the right spot. But obviosly no expert on drm scheduler.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/scheduler/sched_main.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 2f319102ae9f..06a736e506ad 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -763,9 +763,12 @@ static int drm_sched_main(void *param) struct sched_param sparam = {.sched_priority = 1}; struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param; int r; + bool fence_cookie;

sched_setscheduler(current, SCHED_FIFO, &sparam);

+ fence_cookie = dma_fence_begin_signalling(); + while (!kthread_should_stop()) { struct drm_sched_entity *entity = NULL; struct drm_sched_fence *s_fence; @@ -823,6 +826,9 @@ static int drm_sched_main(void *param)

wake_up(&sched->job_scheduled); } + + dma_fence_end_signalling(fence_cookie); + return 0; }

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 10/18] drm/amdgpu: use dma-fence annotations in cs_submit()

This is a bit tricky, since ->notifier_lock is held while calling dma_fence_wait we must ensure that also the read side (i.e. dma_fence_begin_signalling) is on the same side. If we mix this up lockdep complaints, and that's again why we want to have these annotations.

A nice side effect of this is that because of the fs_reclaim priming for dma_fence_enable lockdep now automatically checks for us that nothing in here allocates memory, without even running any userptr workloads.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index a25fb59c127c..e109666aec14 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1212,6 +1212,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p, struct amdgpu_job *job; uint64_t seq; int r; + bool fence_cookie;

job = p->job; p->job = NULL; @@ -1226,6 +1227,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p, */ mutex_lock(&p->adev->notifier_lock);

+ fence_cookie = dma_fence_begin_signalling(); + /* If userptr are invalidated after amdgpu_cs_parser_bos(), return * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl. */ @@ -1262,12 +1265,14 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p, amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); + dma_fence_end_signalling(fence_cookie); mutex_unlock(&p->adev->notifier_lock);

return 0;

error_abort: drm_sched_job_cleanup(&job->base); + dma_fence_end_signalling(fence_cookie); mutex_unlock(&p->adev->notifier_lock);

error_unlock:

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 11/18] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code

My dma-fence lockdep annotations caught an inversion because we allocate memory where we really shouldn't:

kmem_cache_alloc+0x2b/0x6d0 amdgpu_fence_emit+0x30/0x330 [amdgpu] amdgpu_ib_schedule+0x306/0x550 [amdgpu] amdgpu_job_run+0x10f/0x260 [amdgpu] drm_sched_main+0x1b9/0x490 [gpu_sched] kthread+0x12e/0x150

Trouble right now is that lockdep only validates against GFP_FS, which would be good enough for shrinkers. But for mmu_notifiers we actually need !GFP_ATOMIC, since they can be called from any page laundering, even if GFP_NOFS or GFP_NOIO are set.

I guess we should improve the lockdep annotations for fs_reclaim_acquire/release.

Ofc real fix is to properly preallocate this fence and stuff it into the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of the way.

v2: Two more allocations in scheduler paths.

Frist one:

__kmalloc+0x58/0x720 amdgpu_vmid_grab+0x100/0xca0 [amdgpu] amdgpu_job_dependency+0xf9/0x120 [amdgpu] drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched] drm_sched_main+0xf9/0x490 [gpu_sched]

Second one:

kmem_cache_alloc+0x2b/0x6d0 amdgpu_sync_fence+0x7e/0x110 [amdgpu] amdgpu_vmid_grab+0x86b/0xca0 [amdgpu] amdgpu_job_dependency+0xf9/0x120 [amdgpu] drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched] drm_sched_main+0xf9/0x490 [gpu_sched]

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index d878fe7fee51..055b47241bb1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, uint32_t seq; int r;

- fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL); + fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC); if (fence == NULL) return -ENOMEM;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c index fe92dcd94d4a..fdcd6659f5ad 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm, if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait)) return amdgpu_sync_fence(sync, ring->vmid_wait, false);

- fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL); + fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC); if (!fences) return -ENOMEM;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index b87ca171986a..330476cc0c86 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -168,7 +168,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f, if (amdgpu_sync_add_later(sync, f, explicit)) return 0;

- e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL); + e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC); if (!e) return -ENOMEM;

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 12/18] drm/amdgpu: DC also loves to allocate stuff where it shouldn't

Not going to bother with a complete&pretty commit message, just offending backtrace:

kvmalloc_node+0x47/0x80 dc_create_state+0x1f/0x60 [amdgpu] dc_commit_state+0xcb/0x9b0 [amdgpu] amdgpu_dm_atomic_commit_tail+0xd31/0x2010 [amdgpu] commit_tail+0xa4/0x140 [drm_kms_helper] drm_atomic_helper_commit+0x152/0x180 [drm_kms_helper] drm_client_modeset_commit_atomic+0x1ea/0x250 [drm] drm_client_modeset_commit_locked+0x55/0x190 [drm] drm_client_modeset_commit+0x24/0x40 [drm]

v2: Found more in DC code, I'm just going to pile them all up.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/amdgpu/atom.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/amd/display/dc/core/dc.c | 4 +++- 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c index 4cfc786699c7..1b0c674fab25 100644 --- a/drivers/gpu/drm/amd/amdgpu/atom.c +++ b/drivers/gpu/drm/amd/amdgpu/atom.c @@ -1226,7 +1226,7 @@ static int amdgpu_atom_execute_table_locked(struct atom_context *ctx, int index, ectx.abort = false; ectx.last_jump = 0; if (ws) - ectx.ws = kcalloc(4, ws, GFP_KERNEL); + ectx.ws = kcalloc(4, ws, GFP_ATOMIC); else ectx.ws = NULL;

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index adabfa929f42..c575e7394d03 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -6833,7 +6833,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, struct dc_stream_update stream_update; } *bundle;

- bundle = kzalloc(sizeof(*bundle), GFP_KERNEL); + bundle = kzalloc(sizeof(*bundle), GFP_ATOMIC);

if (!bundle) { dm_error("Failed to allocate update bundle\n"); diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c index 45cfb7c45566..9a8e321a7a15 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c @@ -1416,8 +1416,10 @@ bool dc_post_update_surfaces_to_stream(struct dc *dc)

struct dc_state *dc_create_state(struct dc *dc) { + /* No you really cant allocate random crap here this late in + * atomic_commit_tail. */ struct dc_state *context = kvzalloc(sizeof(struct dc_state), - GFP_KERNEL); + GFP_ATOMIC);

if (!context) return NULL;

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 13/18] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail

Trying to grab dma_resv_lock while in commit_tail before we've done all the code that leads to the eventual signalling of the vblank event (which can be a dma_fence) is deadlock-y. Don't do that.

Here the solution is easy because just grabbing locks to read something races anyway. We don't need to bother, READ_ONCE is equivalent. And avoids the locking issue.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++++ 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index c575e7394d03..04c11443b9ca 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -6910,7 +6910,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, * explicitly on fences instead * and in general should be called for * blocking commit to as per framework helpers + * + * Yes, this deadlocks, since you're calling dma_resv_lock in a + * path that leads to a dma_fence_signal(). Don't do that. */ +#if 0 r = amdgpu_bo_reserve(abo, true); if (unlikely(r != 0)) DRM_ERROR("failed to reserve buffer before flip\n"); @@ -6920,6 +6924,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, tmz_surface = amdgpu_bo_encrypted(abo);

amdgpu_bo_unreserve(abo); +#endif + /* + * this races anyway, so READ_ONCE isn't any better or worse + * than the stuff above. Except the stuff above can deadlock. + */ + tiling_flags = READ_ONCE(abo->tiling_flags);

fill_dc_plane_info_and_addr( dm->adev, new_plane_state, tiling_flags,

-- 2.26.2

Pierre-Eric Pelloux-Prayer

5 Jun 5 Jun

8:30 a.m.

New subject: [PATCH 13/18] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail

Hi Daniel,

On 04/06/2020 10:12, Daniel Vetter wrote: [...]

...

@@ -6910,7 +6910,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, * explicitly on fences instead * and in general should be called for * blocking commit to as per framework helpers
 *
 * Yes, this deadlocks, since you're calling dma_resv_lock in a
 * path that leads to a dma_fence_signal(). Don't do that.
*/
+#if 0 r = amdgpu_bo_reserve(abo, true); if (unlikely(r != 0)) DRM_ERROR("failed to reserve buffer before flip\n"); @@ -6920,6 +6924,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, tmz_surface = amdgpu_bo_encrypted(abo);
amdgpu_bo_unreserve(abo);
+#endif
/*
 * this races anyway, so READ_ONCE isn't any better or worse
 * than the stuff above. Except the stuff above can deadlock.
 */
tiling_flags = READ_ONCE(abo->tiling_flags);

With this change "tmz_surface" won't be initialized properly. Adding the following line should fix it:

tmz_surface = READ_ONCE(abo->flags) & AMDGPU_GEM_CREATE_ENCRYPTED;

Pierre-Eric

...

fill_dc_plane_info_and_addr(
	dm->adev, new_plane_state, tiling_flags,

Daniel Vetter

12:41 p.m.

New subject: [PATCH 13/18] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail

On Fri, Jun 5, 2020 at 10:30 AM Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com wrote:

...

Hi Daniel,

On 04/06/2020 10:12, Daniel Vetter wrote: [...]

...
@@ -6910,7 +6910,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, * explicitly on fences instead * and in general should be called for * blocking commit to as per framework helpers
         *
         * Yes, this deadlocks, since you're calling dma_resv_lock in a
         * path that leads to a dma_fence_signal(). Don't do that.
         */
+#if 0 r = amdgpu_bo_reserve(abo, true); if (unlikely(r != 0)) DRM_ERROR("failed to reserve buffer before flip\n"); @@ -6920,6 +6924,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state, tmz_surface = amdgpu_bo_encrypted(abo);
          amdgpu_bo_unreserve(abo);
+#endif
        /*
         * this races anyway, so READ_ONCE isn't any better or worse
         * than the stuff above. Except the stuff above can deadlock.
         */
        tiling_flags = READ_ONCE(abo->tiling_flags);
With this change "tmz_surface" won't be initialized properly. Adding the following line should fix it:

tmz_surface = READ_ONCE(abo->flags) & AMDGPU_GEM_CREATE_ENCRYPTED;

So to make this clear, I'm not really proposing to fix up all the drivers in detail. There's a lot more bugs in all the other drivers, I'm pretty sure. The driver fixups really are just quick hacks to illustrate the problem, and at least in some cases, maybe illustrate a possible solution.

For the real fixes I think this needs driver teams working on this, and make sure it's all solid. I can help a bit with review (especially for placing the annotations, e.g. the one I put in cs_submit() annotates a bit too much), but that's it.

Also I think the patch is from before tmz landed, and I just blindly rebased over it :-) -Daniel

...

Pierre-Eric

...

          fill_dc_plane_info_and_addr(
                  dm->adev, new_plane_state, tiling_flags,

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Daniel Vetter

4 Jun 4 Jun

8:12 a.m.

New subject: [PATCH 14/18] drm/scheduler: use dma-fence annotations in tdr work

In the face of unpriviledged userspace being able to submit bogus gpu workloads the kernel needs gpu timeout and reset (tdr) to guarantee that dma_fences actually complete. Annotate this worker to make sure we don't have any accidental locking inversions or other problems lurking.

Originally this was part of the overall scheduler annotation patch. But amdgpu has some glorious inversions here:

- grabs console_lock - does a full modeset, which grabs all kinds of locks (drm_modeset_lock, dma_resv_lock) which can deadlock with dma_fence_wait held inside them. - almost minor at that point, but the modeset code also allocates memory

These all look like they'll be very hard to fix properly, the hardware seems to require a full display reset with any gpu recovery.

Hence split out as a seperate patch.

Since amdgpu isn't the only hardware driver that needs to reset the display (at least gen2/3 on intel have the same problem) we need a generic solution for this. There's two tricks we could still from drm/i915 and lift to dma-fence:

- The big whack, aka force-complete all fences. i915 does this for all pending jobs if the reset is somehow stuck. Trouble is we'd need to do this for all fences in the entire system, and just the book-keeping for that will be fun. Plus lots of drivers use fences for all kinds of internal stuff like memory management, so unconditionally resetting all of them doesn't work.

I'm also hoping that with these fence annotations we could enlist lockdep in finding the last offenders causing deadlocks, and we could remove this get-out-of-jail trick.

- The more feasible approach (across drivers at least as part of the dma_fence contract) is what drm/i915 does for gen2/3: When we need to reset the display we wake up all dma_fence_wait_interruptible calls, or well at least the equivalent of those in i915 internally.

Relying on ioctl restart we force all other threads to release their locks, which means the tdr thread is guaranteed to be able to get them. I think we could implement this at the dma_fence level, including proper lockdep annotations.

dma_fence_begin_tdr(): - must be nested within a dma_fence_begin/end_signalling section - will wake up all interruptible (but not the non-interruptible) dma_fence_wait() calls and force them to complete with a -ERESTARTSYS errno code. All new interrupitble calls to dma_fence_wait() will immeidately fail with the same error code.

dma_fence_end_trdr(): - this will convert dma_fence_wait() calls back to normal.

Of course interrupting dma_fence_wait is only ok if the caller specified that, which means we need to split the annotations into interruptible and non-interruptible version. If we then make sure that we only use interruptible dma_fence_wait() calls while holding drm_modeset_lock we can grab them in tdr code, and allow display resets. Doing the same for dma_resv_lock might be a lot harder, so buffer updates must be avoided.

What's worse, we're not going to be able to make the dma_fence_wait calls in mmu-notifiers interruptible, that doesn't work. So allocating memory still wont' be allowed, even in tdr sections. Plus obviously we can use this trick only in tdr, it is rather intrusive.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/scheduler/sched_main.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 06a736e506ad..e34a44376e87 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -279,9 +279,12 @@ static void drm_sched_job_timedout(struct work_struct *work) { struct drm_gpu_scheduler *sched; struct drm_sched_job *job; + bool fence_cookie;

sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);

+ fence_cookie = dma_fence_begin_signalling(); + /* Protects against concurrent deletion in drm_sched_get_cleanup_job */ spin_lock(&sched->job_list_lock); job = list_first_entry_or_null(&sched->ring_mirror_list, @@ -313,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work) spin_lock(&sched->job_list_lock); drm_sched_start_timeout(sched); spin_unlock(&sched->job_list_lock); + + dma_fence_end_signalling(fence_cookie); }

/**

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 15/18] drm/amdgpu: use dma-fence annotations for gpu reset code

To improve coverage also annotate the gpu reset code itself, since that's called from other places than drm/scheduler (which is already annotated). Annotations nests, so this doesn't break anything, and allows easier testing.

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@ffwll.ch --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index a027a8f7b281..ac0286a5f2fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4215,6 +4215,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) ? true : false; bool audio_suspended = false; + bool fence_cookie; + + fence_cookie = dma_fence_begin_signalling();

/* * Flush RAM to disk so that after reboot @@ -4243,6 +4246,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", job ? job->base.id : -1, hive->hive_id); mutex_unlock(&hive->hive_lock); + dma_fence_end_signalling(fence_cookie); return 0; }

@@ -4253,8 +4257,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, */ INIT_LIST_HEAD(&device_list); if (adev->gmc.xgmi.num_physical_nodes > 1) { - if (!hive) + if (!hive) { + dma_fence_end_signalling(fence_cookie); return -ENODEV; + } if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list)) list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list); device_list_handle = &hive->device_list; @@ -4269,6 +4275,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", job ? job->base.id : -1); mutex_unlock(&hive->hive_lock); + dma_fence_end_signalling(fence_cookie); return 0; }

@@ -4409,6 +4416,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

if (r) dev_info(adev->dev, "GPU reset end with ret = %d\n", r); + dma_fence_end_signalling(fence_cookie); return r; }

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 16/18] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"

This is one from the department of "maybe play lottery if you hit this, karma compensation might work". Or at least lockdep ftw!

This reverts commit 565d1941557756a584ac357d945bc374d5fcd1d0.

It's not quite as low-risk as the commit message claims, because this grabs console_lock, which might be held when we allocate memory, which might never happen because the dma_fence_wait() is stuck waiting on our gpu reset:

[ 136.763714] ====================================================== [ 136.763714] WARNING: possible circular locking dependency detected [ 136.763715] 5.7.0-rc3+ #346 Tainted: G W [ 136.763716] ------------------------------------------------------ [ 136.763716] kworker/2:3/682 is trying to acquire lock: [ 136.763716] ffffffff8226f140 (console_lock){+.+.}-{0:0}, at: drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763723] but task is already holding lock: [ 136.763724] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] [ 136.763726] which lock already depends on the new lock.

[ 136.763726] the existing dependency chain (in reverse order) is: [ 136.763727] -> #2 (dma_fence_map){++++}-{0:0}: [ 136.763730] __dma_fence_might_wait+0x41/0xb0 [ 136.763732] dma_resv_lockdep+0x171/0x202 [ 136.763734] do_one_initcall+0x5d/0x2f0 [ 136.763736] kernel_init_freeable+0x20d/0x26d [ 136.763738] kernel_init+0xa/0xfb [ 136.763740] ret_from_fork+0x27/0x50 [ 136.763740] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 136.763743] fs_reclaim_acquire.part.0+0x25/0x30 [ 136.763745] kmem_cache_alloc_trace+0x2e/0x6e0 [ 136.763747] device_create_groups_vargs+0x52/0xf0 [ 136.763747] device_create+0x49/0x60 [ 136.763749] fb_console_init+0x25/0x145 [ 136.763750] fbmem_init+0xcc/0xe2 [ 136.763750] do_one_initcall+0x5d/0x2f0 [ 136.763751] kernel_init_freeable+0x20d/0x26d [ 136.763752] kernel_init+0xa/0xfb [ 136.763753] ret_from_fork+0x27/0x50 [ 136.763753] -> #0 (console_lock){+.+.}-{0:0}: [ 136.763755] __lock_acquire+0x1241/0x23f0 [ 136.763756] lock_acquire+0xad/0x370 [ 136.763757] console_lock+0x47/0x70 [ 136.763761] drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763809] amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu] [ 136.763850] amdgpu_job_timedout+0xfb/0x150 [amdgpu] [ 136.763851] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] [ 136.763852] process_one_work+0x23c/0x580 [ 136.763853] worker_thread+0x50/0x3b0 [ 136.763854] kthread+0x12e/0x150 [ 136.763855] ret_from_fork+0x27/0x50 [ 136.763855] other info that might help us debug this:

[ 136.763856] Chain exists of: console_lock --> fs_reclaim --> dma_fence_map

[ 136.763857] Possible unsafe locking scenario:

[ 136.763857] CPU0 CPU1 [ 136.763857] ---- ---- [ 136.763857] lock(dma_fence_map); [ 136.763858] lock(fs_reclaim); [ 136.763858] lock(dma_fence_map); [ 136.763858] lock(console_lock); [ 136.763859] *** DEADLOCK ***

[ 136.763860] 4 locks held by kworker/2:3/682: [ 136.763860] #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 [ 136.763862] #1: ffffc90000cafe58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 [ 136.763863] #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] [ 136.763865] #3: ffff8887ab621748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x5ab/0xe7b [amdgpu] [ 136.763914] stack backtrace: [ 136.763915] CPU: 2 PID: 682 Comm: kworker/2:3 Tainted: G W 5.7.0-rc3+ #346 [ 136.763916] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 [ 136.763918] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 136.763919] Call Trace: [ 136.763922] dump_stack+0x8f/0xd0 [ 136.763924] check_noncircular+0x162/0x180 [ 136.763926] __lock_acquire+0x1241/0x23f0 [ 136.763927] lock_acquire+0xad/0x370 [ 136.763932] ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763933] ? mark_held_locks+0x2d/0x80 [ 136.763934] ? _raw_spin_unlock_irqrestore+0x46/0x60 [ 136.763936] console_lock+0x47/0x70 [ 136.763940] ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763944] drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763993] amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu] [ 136.764036] amdgpu_job_timedout+0xfb/0x150 [amdgpu] [ 136.764038] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] [ 136.764040] process_one_work+0x23c/0x580 [ 136.764041] worker_thread+0x50/0x3b0 [ 136.764042] ? process_one_work+0x580/0x580 [ 136.764044] kthread+0x12e/0x150 [ 136.764045] ? kthread_create_worker_on_cpu+0x70/0x70 [ 136.764046] ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ---- 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index ac0286a5f2fc..4c4492de670c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4063,8 +4063,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, if (r) goto out;

- amdgpu_fbdev_set_suspend(tmp_adev, 0); - /* must succeed. */ amdgpu_ras_resume(tmp_adev);

@@ -4305,8 +4303,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, */ amdgpu_unregister_gpu_instance(tmp_adev);

- amdgpu_fbdev_set_suspend(tmp_adev, 1); - /* disable ras on ALL IPs */ if (!(in_ras_intr && !use_baco) && amdgpu_device_ip_need_full_reset(tmp_adev))

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 17/18] drm/amdgpu: gpu recovery does full modesets

...

I think it's time to stop this little exercise.

The lockdep splat, for the record:

[ 132.583381] ====================================================== [ 132.584091] WARNING: possible circular locking dependency detected [ 132.584775] 5.7.0-rc3+ #346 Tainted: G W [ 132.585461] ------------------------------------------------------ [ 132.586184] kworker/2:3/865 is trying to acquire lock: [ 132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper] [ 132.587569] but task is already holding lock: [ 132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] [ 132.589803] which lock already depends on the new lock.

[ 132.592009] the existing dependency chain (in reverse order) is: [ 132.593507] -> #2 (dma_fence_map){++++}-{0:0}: [ 132.595019] dma_fence_begin_signalling+0x50/0x60 [ 132.595767] drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper] [ 132.596567] drm_client_modeset_commit_atomic+0x1ea/0x250 [drm] [ 132.597420] drm_client_modeset_commit_locked+0x55/0x190 [drm] [ 132.598178] drm_client_modeset_commit+0x24/0x40 [drm] [ 132.598948] drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper] [ 132.599738] drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper] [ 132.600539] fbcon_init+0x2e8/0x660 [ 132.601344] visual_init+0xce/0x130 [ 132.602156] do_bind_con_driver+0x1bc/0x2b0 [ 132.602970] do_take_over_console+0x115/0x180 [ 132.603763] do_fbcon_takeover+0x58/0xb0 [ 132.604564] register_framebuffer+0x1ee/0x300 [ 132.605369] __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper] [ 132.606187] amdgpu_fbdev_init+0xb3/0xf0 [amdgpu] [ 132.607032] amdgpu_device_init.cold+0xe90/0x1677 [amdgpu] [ 132.607862] amdgpu_driver_load_kms+0x5a/0x200 [amdgpu] [ 132.608697] amdgpu_pci_probe+0xf7/0x180 [amdgpu] [ 132.609511] local_pci_probe+0x42/0x80 [ 132.610324] pci_device_probe+0x104/0x1a0 [ 132.611130] really_probe+0x147/0x3c0 [ 132.611939] driver_probe_device+0xb6/0x100 [ 132.612766] device_driver_attach+0x53/0x60 [ 132.613593] __driver_attach+0x8c/0x150 [ 132.614419] bus_for_each_dev+0x7b/0xc0 [ 132.615249] bus_add_driver+0x14c/0x1f0 [ 132.616071] driver_register+0x6c/0xc0 [ 132.616902] do_one_initcall+0x5d/0x2f0 [ 132.617731] do_init_module+0x5c/0x230 [ 132.618560] load_module+0x2981/0x2bc0 [ 132.619391] __do_sys_finit_module+0xaa/0x110 [ 132.620228] do_syscall_64+0x5a/0x250 [ 132.621064] entry_SYSCALL_64_after_hwframe+0x49/0xb3 [ 132.621903] -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}: [ 132.623587] __ww_mutex_lock.constprop.0+0xcc/0x10c0 [ 132.624448] ww_mutex_lock+0x43/0xb0 [ 132.625315] drm_modeset_lock+0x44/0x120 [drm] [ 132.626184] drmm_mode_config_init+0x2db/0x8b0 [drm] [ 132.627098] amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu] [ 132.628007] amdgpu_driver_load_kms+0x5a/0x200 [amdgpu] [ 132.628920] amdgpu_pci_probe+0xf7/0x180 [amdgpu] [ 132.629804] local_pci_probe+0x42/0x80 [ 132.630690] pci_device_probe+0x104/0x1a0 [ 132.631583] really_probe+0x147/0x3c0 [ 132.632479] driver_probe_device+0xb6/0x100 [ 132.633379] device_driver_attach+0x53/0x60 [ 132.634275] __driver_attach+0x8c/0x150 [ 132.635170] bus_for_each_dev+0x7b/0xc0 [ 132.636069] bus_add_driver+0x14c/0x1f0 [ 132.636974] driver_register+0x6c/0xc0 [ 132.637870] do_one_initcall+0x5d/0x2f0 [ 132.638765] do_init_module+0x5c/0x230 [ 132.639654] load_module+0x2981/0x2bc0 [ 132.640522] __do_sys_finit_module+0xaa/0x110 [ 132.641372] do_syscall_64+0x5a/0x250 [ 132.642203] entry_SYSCALL_64_after_hwframe+0x49/0xb3 [ 132.643022] -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}: [ 132.644643] __lock_acquire+0x1241/0x23f0 [ 132.645469] lock_acquire+0xad/0x370 [ 132.646274] drm_modeset_acquire_init+0xd2/0x100 [drm] [ 132.647071] drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper] [ 132.647902] dm_suspend+0x1c/0x60 [amdgpu] [ 132.648698] amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu] [ 132.649498] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu] [ 132.650300] amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu] [ 132.651084] amdgpu_job_timedout+0xfb/0x150 [amdgpu] [ 132.651825] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] [ 132.652594] process_one_work+0x23c/0x580 [ 132.653402] worker_thread+0x50/0x3b0 [ 132.654139] kthread+0x12e/0x150 [ 132.654868] ret_from_fork+0x27/0x50 [ 132.655598] other info that might help us debug this:

[ 132.657739] Chain exists of: crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map

[ 132.659877] Possible unsafe locking scenario:

[ 132.661416] CPU0 CPU1 [ 132.662126] ---- ---- [ 132.662847] lock(dma_fence_map); [ 132.663574] lock(crtc_ww_class_mutex); [ 132.664319] lock(dma_fence_map); [ 132.665063] lock(crtc_ww_class_acquire); [ 132.665799] *** DEADLOCK ***

[ 132.667965] 4 locks held by kworker/2:3/865: [ 132.668701] #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 [ 132.669462] #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 [ 132.670242] #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] [ 132.671039] #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu] [ 132.671902] stack backtrace: [ 132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G W 5.7.0-rc3+ #346 [ 132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 [ 132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 132.676046] Call Trace: [ 132.676897] dump_stack+0x8f/0xd0 [ 132.677748] check_noncircular+0x162/0x180 [ 132.678604] ? stack_trace_save+0x4b/0x70 [ 132.679459] __lock_acquire+0x1241/0x23f0 [ 132.680311] lock_acquire+0xad/0x370 [ 132.681163] ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper] [ 132.682021] ? cpumask_next+0x16/0x20 [ 132.682880] ? module_assert_mutex_or_preempt+0x14/0x40 [ 132.683737] ? __module_address+0x28/0xf0 [ 132.684601] drm_modeset_acquire_init+0xd2/0x100 [drm] [ 132.685466] ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper] [ 132.686335] drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper] [ 132.687255] dm_suspend+0x1c/0x60 [amdgpu] [ 132.688152] amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu] [ 132.689057] ? amdgpu_fence_process+0x4c/0x150 [amdgpu] [ 132.689963] amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu] [ 132.690893] amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu] [ 132.691818] amdgpu_job_timedout+0xfb/0x150 [amdgpu] [ 132.692707] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] [ 132.693597] process_one_work+0x23c/0x580 [ 132.694487] worker_thread+0x50/0x3b0 [ 132.695373] ? process_one_work+0x580/0x580 [ 132.696264] kthread+0x12e/0x150 [ 132.697154] ? kthread_create_worker_on_cpu+0x70/0x70 [ 132.698057] ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++ 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 4c4492de670c..3ea4b9258fb0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2441,6 +2441,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev) /* displays are handled separately */ if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) { /* XXX handle errors */ + + /* + * This is dm_suspend, which calls modeset locks, and + * that a pretty good inversion against dma_fence_signal + * which gpu recovery is supposed to guarantee. + * + * Dont ask me how to fix this. + */ r = adev->ip_blocks[i].version->funcs->suspend(adev); /* XXX handle errors */ if (r) {

-- 2.26.2

Daniel Vetter

8:12 a.m.

New subject: [PATCH 18/18] drm/i915: Annotate dma_fence_work

i915 does tons of allocations from this worker, which lockdep catches.

Also generic infrastructure like this with big potential for how dma_fence or other cross driver contracts work, really should be reviewed on dri-devel. Implementing custom wheels for everything within the driver is a classic case of "platform problem" [1]. Which in upstream we really shouldn't have.

Since there's no quick way to solve these splats (dma_fence_work is used a bunch in basic buffer management and command submission) like for amdgpu, I'm giving up at this point here. Annotating i915 scheduler and gpu reset could would be interesting, but since lockdep is one-shot we can't see what surprises would lurk there.

1: https://lwn.net/Articles/443531/ Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c b/drivers/gpu/drm/i915/i915_sw_fence_work.c index a3a81bb8f2c3..5b74acadaef5 100644 --- a/drivers/gpu/drm/i915/i915_sw_fence_work.c +++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c @@ -17,12 +17,15 @@ static void fence_work(struct work_struct *work) { struct dma_fence_work *f = container_of(work, typeof(*f), work); int err; + bool fence_cookie;

+ fence_cookie = dma_fence_begin_signalling(); err = f->ops->work(f); if (err) dma_fence_set_error(&f->dma, err);

fence_complete(f); + dma_fence_end_signalling(fence_cookie); dma_fence_put(&f->dma); }

-- 2.26.2

1761

Age (days ago)

1796

Last active (days ago)

dri-devel@lists.freedesktop.org

107 comments

15 participants

tags (0)

participants (15)

Alex Deucher
Chris Wilson
Daniel Stone
Daniel Vetter
Daniel Vetter
Dave Airlie
Dave Chinner
Felix Kuehling
Jason Gunthorpe
Jerome Glisse
Maarten Lankhorst
Pierre-Eric Pelloux-Prayer
Qian Cai
Thomas Hellström (Intel)
Tvrtko Ursulin