Hi all,
Here's the respin. Changes:
- 2 patches for checking return values of callbacks dropped, they landed
- move the lockdep annotations ahead, since I think that part is less contentious. lockdep map now also annotates invalidate_range_end, as requested by Jason.
- add a patch to prime lockdep, idea from Jason, let's hear whether the implementation fits.
- I've stuck with the non_block_start/end for now and not switched back to preempt_disable/enable, but with comments as suggested by Andrew. Hopefully that fits the bill, otherwise I can go back again if the consensus is more there.
Review, comments and ideas very much welcome.
Cheers, Daniel
Daniel Vetter (4): mm, notifier: Add a lockdep map for invalidate_range_start/end mm, notifier: Prime lockdep kernel.h: Add non_block_start/end() mm, notifier: Catch sleeping/blocking for !blockable
include/linux/kernel.h | 25 ++++++++++++++++++++++++- include/linux/mmu_notifier.h | 8 ++++++++ include/linux/sched.h | 4 ++++ kernel/sched/core.c | 19 ++++++++++++++----- mm/mmu_notifier.c | 24 +++++++++++++++++++++++- 5 files changed, 73 insertions(+), 7 deletions(-)
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly easy to provoke a specific notifier to be run on a specific range: Just prep it, and then munmap() it.
A bit harder, but still doable, is to provoke the mmu notifiers for all the various callchains that might lead to them. But both at the same time is really hard to reliable hit, especially when you want to exercise paths like direct reclaim or compaction, where it's not easy to control what exactly will be unmapped.
By introducing a lockdep map to tie them all together we allow lockdep to see a lot more dependencies, without having to actually hit them in a single challchain while testing.
On Jason's suggestion this is is rolled out for both invalidate_range_start and invalidate_range_end. They both have the same calling context, hence we can share the same lockdep map. Note that the annotation for invalidate_ranage_start is outside of the mm_has_notifiers(), to make sure lockdep is informed about all paths leading to this context irrespective of whether mmu notifiers are present for a given context. We don't do that on the invalidate_range_end side to avoid paying the overhead twice, there the lockdep annotation is pushed down behind the mm_has_notifiers() check.
v2: Use lock_map_acquire/release() like fs_reclaim, to avoid confusion with this being a real mutex (Chris Wilson).
v3: Rebase on top of Glisse's arg rework.
v4: Also annotate invalidate_range_end (Jason Gunthorpe) Also annotate invalidate_range_start_nonblock, I somehow missed that one in the first version.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: David Rientjes rientjes@google.com Cc: "Jérôme Glisse" jglisse@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: "Christian König" christian.koenig@amd.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: linux-mm@kvack.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- include/linux/mmu_notifier.h | 8 ++++++++ mm/mmu_notifier.c | 9 +++++++++ 2 files changed, 17 insertions(+)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b6c004bd9f6a..39a86b77a939 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -42,6 +42,10 @@ enum mmu_notifier_event {
#ifdef CONFIG_MMU_NOTIFIER
+#ifdef CONFIG_LOCKDEP +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map; +#endif + /* * The mmu notifier_mm structure is allocated and installed in * mm->mmu_notifier_mm inside the mm_take_all_locks() protected @@ -310,19 +314,23 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, static inline void mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); if (mm_has_notifiers(range->mm)) { range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE; __mmu_notifier_invalidate_range_start(range); } + lock_map_release(&__mmu_notifier_invalidate_range_start_map); }
static inline int mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); if (mm_has_notifiers(range->mm)) { range->flags &= ~MMU_NOTIFIER_RANGE_BLOCKABLE; return __mmu_notifier_invalidate_range_start(range); } + lock_map_release(&__mmu_notifier_invalidate_range_start_map); return 0; }
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 16f1cbc775d0..d12e3079e7a4 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -21,6 +21,13 @@ /* global SRCU for all MMs */ DEFINE_STATIC_SRCU(srcu);
+#ifdef CONFIG_LOCKDEP +struct lockdep_map __mmu_notifier_invalidate_range_start_map = { + .name = "mmu_notifier_invalidate_range_start" +}; +EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start_map); +#endif + /* * This function allows mmu_notifier::release callback to delay a call to * a function that will free appropriate resources. The function must be @@ -197,6 +204,7 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, struct mmu_notifier *mn; int id;
+ lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { /* @@ -220,6 +228,7 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range, mn->ops->invalidate_range_end(mn, range); } srcu_read_unlock(&srcu, id); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); } EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end);
On Tue, Aug 20, 2019 at 10:18:59AM +0200, Daniel Vetter wrote:
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly easy to provoke a specific notifier to be run on a specific range: Just prep it, and then munmap() it.
A bit harder, but still doable, is to provoke the mmu notifiers for all the various callchains that might lead to them. But both at the same time is really hard to reliable hit, especially when you want to exercise paths like direct reclaim or compaction, where it's not easy to control what exactly will be unmapped.
By introducing a lockdep map to tie them all together we allow lockdep to see a lot more dependencies, without having to actually hit them in a single challchain while testing.
On Jason's suggestion this is is rolled out for both invalidate_range_start and invalidate_range_end. They both have the same calling context, hence we can share the same lockdep map. Note that the annotation for invalidate_ranage_start is outside of the mm_has_notifiers(), to make sure lockdep is informed about all paths leading to this context irrespective of whether mmu notifiers are present for a given context. We don't do that on the invalidate_range_end side to avoid paying the overhead twice, there the lockdep annotation is pushed down behind the mm_has_notifiers() check.
v2: Use lock_map_acquire/release() like fs_reclaim, to avoid confusion with this being a real mutex (Chris Wilson).
v3: Rebase on top of Glisse's arg rework.
v4: Also annotate invalidate_range_end (Jason Gunthorpe) Also annotate invalidate_range_start_nonblock, I somehow missed that one in the first version.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: David Rientjes rientjes@google.com Cc: "Jérôme Glisse" jglisse@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: "Christian König" christian.koenig@amd.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: linux-mm@kvack.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com
include/linux/mmu_notifier.h | 8 ++++++++ mm/mmu_notifier.c | 9 +++++++++ 2 files changed, 17 insertions(+)
Reviewed-by: Jason Gunthorpe jgg@mellanox.com
Jason
We want to teach lockdep that mmu notifiers can be called from direct reclaim paths, since on many CI systems load might never reach that level (e.g. when just running fuzzer or small functional tests).
Motivated by a discussion with Jason.
I've put the annotation into mmu_notifier_register since only when we have mmu notifiers registered is there any point in teaching lockdep about them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: David Rientjes rientjes@google.com Cc: "Jérôme Glisse" jglisse@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: "Christian König" christian.koenig@amd.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: linux-mm@kvack.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- mm/mmu_notifier.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index d12e3079e7a4..538d3bb87f9b 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -256,6 +256,13 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
BUG_ON(atomic_read(&mm->mm_users) <= 0);
+ if (IS_ENABLED(CONFIG_LOCKDEP)) { + fs_reclaim_acquire(GFP_KERNEL); + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); + lock_map_release(&__mmu_notifier_invalidate_range_start_map); + fs_reclaim_release(GFP_KERNEL); + } + ret = -ENOMEM; mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); if (unlikely(!mmu_notifier_mm))
On Tue, Aug 20, 2019 at 10:19:00AM +0200, Daniel Vetter wrote:
We want to teach lockdep that mmu notifiers can be called from direct reclaim paths, since on many CI systems load might never reach that level (e.g. when just running fuzzer or small functional tests).
Motivated by a discussion with Jason.
I've put the annotation into mmu_notifier_register since only when we have mmu notifiers registered is there any point in teaching lockdep about them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: David Rientjes rientjes@google.com Cc: "Jérôme Glisse" jglisse@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: "Christian König" christian.koenig@amd.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: linux-mm@kvack.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com mm/mmu_notifier.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index d12e3079e7a4..538d3bb87f9b 100644 +++ b/mm/mmu_notifier.c @@ -256,6 +256,13 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
BUG_ON(atomic_read(&mm->mm_users) <= 0);
- if (IS_ENABLED(CONFIG_LOCKDEP)) {
fs_reclaim_acquire(GFP_KERNEL);
lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
lock_map_release(&__mmu_notifier_invalidate_range_start_map);
fs_reclaim_release(GFP_KERNEL);
- }
Lets try it out at least
Reviewed-by: Jason Gunthorpe jgg@mellanox.com
Jason
In some special cases we must not block, but there's not a spinlock, preempt-off, irqs-off or similar critical section already that arms the might_sleep() debug checks. Add a non_block_start/end() pair to annotate these.
This will be used in the oom paths of mmu-notifiers, where blocking is not allowed to make sure there's forward progress. Quoting Michal:
"The notifier is called from quite a restricted context - oom_reaper - which shouldn't depend on any locks or sleepable conditionals. The code should be swift as well but we mostly do care about it to make a forward progress. Checking for sleepable context is the best thing we could come up with that would describe these demands at least partially."
Peter also asked whether we want to catch spinlocks on top, but Michal said those are less of a problem because spinlocks can't have an indirect dependency upon the page allocator and hence close the loop with the oom reaper.
Suggested by Michal Hocko.
v2: - Improve commit message (Michal) - Also check in schedule, not just might_sleep (Peter)
v3: It works better when I actually squash in the fixup I had lying around :-/
v4: Pick the suggestion from Andrew Morton to give non_block_start/end some good kerneldoc comments. I added that other blocking calls like wait_event pose similar issues, since that's the other example we discussed.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Christian König" christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: "Jérôme Glisse" jglisse@redhat.com Cc: linux-mm@kvack.org Cc: Masahiro Yamada yamada.masahiro@socionext.com Cc: Wei Wang wvw@google.com Cc: Andy Shevchenko andriy.shevchenko@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Jann Horn jannh@google.com Cc: Feng Tang feng.tang@intel.com Cc: Kees Cook keescook@chromium.org Cc: Randy Dunlap rdunlap@infradead.org Cc: linux-kernel@vger.kernel.org Acked-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- include/linux/kernel.h | 25 ++++++++++++++++++++++++- include/linux/sched.h | 4 ++++ kernel/sched/core.c | 19 ++++++++++++++----- 3 files changed, 42 insertions(+), 6 deletions(-)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 4fa360a13c1e..82f84cfe372f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); * might_sleep - annotation for functions that can sleep * * this macro will print a stack trace if it is executed in an atomic - * context (spinlock, irq-handler, ...). + * context (spinlock, irq-handler, ...). Additional sections where blocking is + * not allowed can be annotated with non_block_start() and non_block_end() + * pairs. * * This is a useful debugging help to be able to catch problems early and not * be bitten later when the calling function happens to sleep when it is not @@ -233,6 +235,25 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define cant_sleep() \ do { __cant_sleep(__FILE__, __LINE__, 0); } while (0) # define sched_annotate_sleep() (current->task_state_change = 0) +/** + * non_block_start - annotate the start of section where sleeping is prohibited + * + * This is on behalf of the oom reaper, specifically when it is calling the mmu + * notifiers. The problem is that if the notifier were to block on, for example, + * mutex_lock() and if the process which holds that mutex were to perform a + * sleeping memory allocation, the oom reaper is now blocked on completion of + * that memory allocation. Other blocking calls like wait_event() pose similar + * issues. + */ +# define non_block_start() \ + do { current->non_block_count++; } while (0) +/** + * non_block_end - annotate the end of section where sleeping is prohibited + * + * Closes a section opened by non_block_start(). + */ +# define non_block_end() \ + do { WARN_ON(current->non_block_count-- == 0); } while (0) #else static inline void ___might_sleep(const char *file, int line, int preempt_offset) { } @@ -241,6 +262,8 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define might_sleep() do { might_resched(); } while (0) # define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) +# define non_block_start() do { } while (0) +# define non_block_end() do { } while (0) #endif
#define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9f51932bd543..c5630f3dca1f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -974,6 +974,10 @@ struct task_struct { struct mutex_waiter *blocked_on; #endif
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + int non_block_count; +#endif + #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; unsigned long hardirq_enable_ip; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b037f195473..57245770d6cc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3700,13 +3700,22 @@ static noinline void __schedule_bug(struct task_struct *prev) /* * Various schedule()-time debugging checks and statistics: */ -static inline void schedule_debug(struct task_struct *prev) +static inline void schedule_debug(struct task_struct *prev, bool preempt) { #ifdef CONFIG_SCHED_STACK_END_CHECK if (task_stack_end_corrupted(prev)) panic("corrupted stack end detected inside scheduler\n"); #endif
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP + if (!preempt && prev->state && prev->non_block_count) { + printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n", + prev->comm, prev->pid, prev->non_block_count); + dump_stack(); + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); + } +#endif + if (unlikely(in_atomic_preempt_off())) { __schedule_bug(prev); preempt_count_set(PREEMPT_DISABLED); @@ -3813,7 +3822,7 @@ static void __sched notrace __schedule(bool preempt) rq = cpu_rq(cpu); prev = rq->curr;
- schedule_debug(prev); + schedule_debug(prev, preempt);
if (sched_feat(HRTICK)) hrtick_clear(rq); @@ -6570,7 +6579,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset) rcu_sleep_check();
if ((preempt_count_equals(preempt_offset) && !irqs_disabled() && - !is_idle_task(current)) || + !is_idle_task(current) && !current->non_block_count) || system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING || oops_in_progress) return; @@ -6586,8 +6595,8 @@ void ___might_sleep(const char *file, int line, int preempt_offset) "BUG: sleeping function called from invalid context at %s:%d\n", file, line); printk(KERN_ERR - "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", - in_atomic(), irqs_disabled(), + "in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n", + in_atomic(), irqs_disabled(), current->non_block_count, current->pid, current->comm);
if (task_stack_end_corrupted(current))
On Tue, Aug 20, 2019 at 10:19:01AM +0200, Daniel Vetter wrote:
In some special cases we must not block, but there's not a spinlock, preempt-off, irqs-off or similar critical section already that arms the might_sleep() debug checks. Add a non_block_start/end() pair to annotate these.
This will be used in the oom paths of mmu-notifiers, where blocking is not allowed to make sure there's forward progress. Quoting Michal:
"The notifier is called from quite a restricted context - oom_reaper - which shouldn't depend on any locks or sleepable conditionals. The code should be swift as well but we mostly do care about it to make a forward progress. Checking for sleepable context is the best thing we could come up with that would describe these demands at least partially."
Peter also asked whether we want to catch spinlocks on top, but Michal said those are less of a problem because spinlocks can't have an indirect dependency upon the page allocator and hence close the loop with the oom reaper.
Suggested by Michal Hocko.
v2:
- Improve commit message (Michal)
- Also check in schedule, not just might_sleep (Peter)
v3: It works better when I actually squash in the fixup I had lying around :-/
v4: Pick the suggestion from Andrew Morton to give non_block_start/end some good kerneldoc comments. I added that other blocking calls like wait_event pose similar issues, since that's the other example we discussed.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Christian König" christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: "Jérôme Glisse" jglisse@redhat.com Cc: linux-mm@kvack.org Cc: Masahiro Yamada yamada.masahiro@socionext.com Cc: Wei Wang wvw@google.com Cc: Andy Shevchenko andriy.shevchenko@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Jann Horn jannh@google.com Cc: Feng Tang feng.tang@intel.com Cc: Kees Cook keescook@chromium.org Cc: Randy Dunlap rdunlap@infradead.org Cc: linux-kernel@vger.kernel.org Acked-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Hi Peter,
Iirc you've been involved at least somewhat in discussing this. -mm folks are a bit undecided whether these new non_block semantics are a good idea. Michal Hocko still is in support, but Andrew Morton and Jason Gunthorpe are less enthusiastic. Jason said he's ok with merging the hmm side of this if scheduler folks ack. If not, then I'll respin with the preempt_disable/enable instead like in v1.
So ack/nack for this from the scheduler side?
Thanks, Daniel
include/linux/kernel.h | 25 ++++++++++++++++++++++++- include/linux/sched.h | 4 ++++ kernel/sched/core.c | 19 ++++++++++++++----- 3 files changed, 42 insertions(+), 6 deletions(-)
diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 4fa360a13c1e..82f84cfe372f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
- might_sleep - annotation for functions that can sleep
- this macro will print a stack trace if it is executed in an atomic
- context (spinlock, irq-handler, ...).
- context (spinlock, irq-handler, ...). Additional sections where blocking is
- not allowed can be annotated with non_block_start() and non_block_end()
- pairs.
- This is a useful debugging help to be able to catch problems early and not
- be bitten later when the calling function happens to sleep when it is not
@@ -233,6 +235,25 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define cant_sleep() \ do { __cant_sleep(__FILE__, __LINE__, 0); } while (0) # define sched_annotate_sleep() (current->task_state_change = 0) +/**
- non_block_start - annotate the start of section where sleeping is prohibited
- This is on behalf of the oom reaper, specifically when it is calling the mmu
- notifiers. The problem is that if the notifier were to block on, for example,
- mutex_lock() and if the process which holds that mutex were to perform a
- sleeping memory allocation, the oom reaper is now blocked on completion of
- that memory allocation. Other blocking calls like wait_event() pose similar
- issues.
- */
+# define non_block_start() \
- do { current->non_block_count++; } while (0)
+/**
- non_block_end - annotate the end of section where sleeping is prohibited
- Closes a section opened by non_block_start().
- */
+# define non_block_end() \
- do { WARN_ON(current->non_block_count-- == 0); } while (0)
#else static inline void ___might_sleep(const char *file, int line, int preempt_offset) { } @@ -241,6 +262,8 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); # define might_sleep() do { might_resched(); } while (0) # define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) +# define non_block_start() do { } while (0) +# define non_block_end() do { } while (0) #endif
#define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9f51932bd543..c5630f3dca1f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -974,6 +974,10 @@ struct task_struct { struct mutex_waiter *blocked_on; #endif
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
- int non_block_count;
+#endif
#ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; unsigned long hardirq_enable_ip; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2b037f195473..57245770d6cc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3700,13 +3700,22 @@ static noinline void __schedule_bug(struct task_struct *prev) /*
- Various schedule()-time debugging checks and statistics:
*/ -static inline void schedule_debug(struct task_struct *prev) +static inline void schedule_debug(struct task_struct *prev, bool preempt) { #ifdef CONFIG_SCHED_STACK_END_CHECK if (task_stack_end_corrupted(prev)) panic("corrupted stack end detected inside scheduler\n"); #endif
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
- if (!preempt && prev->state && prev->non_block_count) {
printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
prev->comm, prev->pid, prev->non_block_count);
dump_stack();
add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
- }
+#endif
- if (unlikely(in_atomic_preempt_off())) { __schedule_bug(prev); preempt_count_set(PREEMPT_DISABLED);
@@ -3813,7 +3822,7 @@ static void __sched notrace __schedule(bool preempt) rq = cpu_rq(cpu); prev = rq->curr;
- schedule_debug(prev);
schedule_debug(prev, preempt);
if (sched_feat(HRTICK)) hrtick_clear(rq);
@@ -6570,7 +6579,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset) rcu_sleep_check();
if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
!is_idle_task(current)) ||
return;!is_idle_task(current) && !current->non_block_count) || system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING || oops_in_progress)
@@ -6586,8 +6595,8 @@ void ___might_sleep(const char *file, int line, int preempt_offset) "BUG: sleeping function called from invalid context at %s:%d\n", file, line); printk(KERN_ERR
"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
in_atomic(), irqs_disabled(),
"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
in_atomic(), irqs_disabled(), current->non_block_count, current->pid, current->comm);
if (task_stack_end_corrupted(current))
-- 2.23.0.rc1
On Tue, 20 Aug 2019 22:24:40 +0200 Daniel Vetter daniel@ffwll.ch wrote:
Hi Peter,
Iirc you've been involved at least somewhat in discussing this. -mm folks are a bit undecided whether these new non_block semantics are a good idea. Michal Hocko still is in support, but Andrew Morton and Jason Gunthorpe are less enthusiastic. Jason said he's ok with merging the hmm side of this if scheduler folks ack. If not, then I'll respin with the preempt_disable/enable instead like in v1.
I became mollified once Michel explained the rationale. I think it's OK. It's very specific to the oom reaper and hopefully won't be used more widely(?).
On Fri, Aug 23, 2019 at 1:14 AM Andrew Morton akpm@linux-foundation.org wrote:
On Tue, 20 Aug 2019 22:24:40 +0200 Daniel Vetter daniel@ffwll.ch wrote:
Hi Peter,
Iirc you've been involved at least somewhat in discussing this. -mm folks are a bit undecided whether these new non_block semantics are a good idea. Michal Hocko still is in support, but Andrew Morton and Jason Gunthorpe are less enthusiastic. Jason said he's ok with merging the hmm side of this if scheduler folks ack. If not, then I'll respin with the preempt_disable/enable instead like in v1.
I became mollified once Michel explained the rationale. I think it's OK. It's very specific to the oom reaper and hopefully won't be used more widely(?).
Yeah, no plans for that from me. And I hope the comment above them now explains why they exist, so people think twice before using it in random places. -Daniel
On Tue, Aug 20, 2019 at 10:24:40PM +0200, Daniel Vetter wrote:
On Tue, Aug 20, 2019 at 10:19:01AM +0200, Daniel Vetter wrote:
In some special cases we must not block, but there's not a spinlock, preempt-off, irqs-off or similar critical section already that arms the might_sleep() debug checks. Add a non_block_start/end() pair to annotate these.
This will be used in the oom paths of mmu-notifiers, where blocking is not allowed to make sure there's forward progress. Quoting Michal:
"The notifier is called from quite a restricted context - oom_reaper - which shouldn't depend on any locks or sleepable conditionals. The code should be swift as well but we mostly do care about it to make a forward progress. Checking for sleepable context is the best thing we could come up with that would describe these demands at least partially."
Peter also asked whether we want to catch spinlocks on top, but Michal said those are less of a problem because spinlocks can't have an indirect dependency upon the page allocator and hence close the loop with the oom reaper.
Suggested by Michal Hocko.
v2:
- Improve commit message (Michal)
- Also check in schedule, not just might_sleep (Peter)
v3: It works better when I actually squash in the fixup I had lying around :-/
v4: Pick the suggestion from Andrew Morton to give non_block_start/end some good kerneldoc comments. I added that other blocking calls like wait_event pose similar issues, since that's the other example we discussed.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Christian König" christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: "Jérôme Glisse" jglisse@redhat.com Cc: linux-mm@kvack.org Cc: Masahiro Yamada yamada.masahiro@socionext.com Cc: Wei Wang wvw@google.com Cc: Andy Shevchenko andriy.shevchenko@linux.intel.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Jann Horn jannh@google.com Cc: Feng Tang feng.tang@intel.com Cc: Kees Cook keescook@chromium.org Cc: Randy Dunlap rdunlap@infradead.org Cc: linux-kernel@vger.kernel.org Acked-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Hi Peter,
Iirc you've been involved at least somewhat in discussing this. -mm folks are a bit undecided whether these new non_block semantics are a good idea. Michal Hocko still is in support, but Andrew Morton and Jason Gunthorpe are less enthusiastic. Jason said he's ok with merging the hmm side of this if scheduler folks ack. If not, then I'll respin with the preempt_disable/enable instead like in v1.
So ack/nack for this from the scheduler side?
Right, I had memories of seeing this before, and I just found a fairly long discussion on this elsewhere in the vacation inbox (*groan*).
Yeah, this is something I can live with,
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org
We need to make sure implementations don't cheat and don't have a possible schedule/blocking point deeply burried where review can't catch it.
I'm not sure whether this is the best way to make sure all the might_sleep() callsites trigger, and it's a bit ugly in the code flow. But it gets the job done.
Inspired by an i915 patch series which did exactly that, because the rules haven't been entirely clear to us.
v2: Use the shiny new non_block_start/end annotations instead of abusing preempt_disable/enable.
v3: Rebase on top of Glisse's arg rework.
v4: Rebase on top of more Glisse rework.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Christian König" christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: "Jérôme Glisse" jglisse@redhat.com Cc: linux-mm@kvack.org Reviewed-by: Christian König christian.koenig@amd.com Reviewed-by: Jérôme Glisse jglisse@redhat.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- mm/mmu_notifier.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 538d3bb87f9b..856636d06ee0 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) { - int _ret = mn->ops->invalidate_range_start(mn, range); + int _ret; + + if (!mmu_notifier_range_blockable(range)) + non_block_start(); + _ret = mn->ops->invalidate_range_start(mn, range); + if (!mmu_notifier_range_blockable(range)) + non_block_end(); if (_ret) { pr_info("%pS callback failed with %d in %sblockable context.\n", mn->ops->invalidate_range_start, _ret,
On Tue, Aug 20, 2019 at 10:19:02AM +0200, Daniel Vetter wrote:
We need to make sure implementations don't cheat and don't have a possible schedule/blocking point deeply burried where review can't catch it.
I'm not sure whether this is the best way to make sure all the might_sleep() callsites trigger, and it's a bit ugly in the code flow. But it gets the job done.
Inspired by an i915 patch series which did exactly that, because the rules haven't been entirely clear to us.
v2: Use the shiny new non_block_start/end annotations instead of abusing preempt_disable/enable.
v3: Rebase on top of Glisse's arg rework.
v4: Rebase on top of more Glisse rework.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Christian König" christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: "Jérôme Glisse" jglisse@redhat.com Cc: linux-mm@kvack.org Reviewed-by: Christian König christian.koenig@amd.com Reviewed-by: Jérôme Glisse jglisse@redhat.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com mm/mmu_notifier.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 538d3bb87f9b..856636d06ee0 100644 +++ b/mm/mmu_notifier.c @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) {
int _ret = mn->ops->invalidate_range_start(mn, range);
int _ret;
if (!mmu_notifier_range_blockable(range))
non_block_start();
_ret = mn->ops->invalidate_range_start(mn, range);
if (!mmu_notifier_range_blockable(range))
non_block_end();
If someone Acks all the sched changes then I can pick this for hmm.git, but I still think the existing pre-emption debugging is fine for this use case.
Also, same comment as for the lockdep map, this needs to apply to the non-blocking range_end also.
Anyhow, since this series has conflicts with hmm.git it would be best to flow through the whole thing through that tree. If there are no remarks on the first two patches I'll grab them in a few days.
Regards, Jason
On Tue, Aug 20, 2019 at 10:34:18AM -0300, Jason Gunthorpe wrote:
On Tue, Aug 20, 2019 at 10:19:02AM +0200, Daniel Vetter wrote:
We need to make sure implementations don't cheat and don't have a possible schedule/blocking point deeply burried where review can't catch it.
I'm not sure whether this is the best way to make sure all the might_sleep() callsites trigger, and it's a bit ugly in the code flow. But it gets the job done.
Inspired by an i915 patch series which did exactly that, because the rules haven't been entirely clear to us.
v2: Use the shiny new non_block_start/end annotations instead of abusing preempt_disable/enable.
v3: Rebase on top of Glisse's arg rework.
v4: Rebase on top of more Glisse rework.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Christian König" christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: "Jérôme Glisse" jglisse@redhat.com Cc: linux-mm@kvack.org Reviewed-by: Christian König christian.koenig@amd.com Reviewed-by: Jérôme Glisse jglisse@redhat.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com mm/mmu_notifier.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 538d3bb87f9b..856636d06ee0 100644 +++ b/mm/mmu_notifier.c @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) {
int _ret = mn->ops->invalidate_range_start(mn, range);
int _ret;
if (!mmu_notifier_range_blockable(range))
non_block_start();
_ret = mn->ops->invalidate_range_start(mn, range);
if (!mmu_notifier_range_blockable(range))
non_block_end();
If someone Acks all the sched changes then I can pick this for hmm.git, but I still think the existing pre-emption debugging is fine for this use case.
Ok, I'll ping Peter Z. for an ack, iirc he was involved.
Also, same comment as for the lockdep map, this needs to apply to the non-blocking range_end also.
Hm, I thought the page table locks we're holding there already prevent any sleeping, so would be redundant? But reading through code I think that's not guaranteed, so yeah makes sense to add it for invalidate_range_end too. I'll respin once I have the ack/nack from scheduler people.
Anyhow, since this series has conflicts with hmm.git it would be best to flow through the whole thing through that tree. If there are no remarks on the first two patches I'll grab them in a few days.
Thanks, Daniel
On Tue, Aug 20, 2019 at 05:18:10PM +0200, Daniel Vetter wrote:
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 538d3bb87f9b..856636d06ee0 100644 +++ b/mm/mmu_notifier.c @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) {
int _ret = mn->ops->invalidate_range_start(mn, range);
int _ret;
if (!mmu_notifier_range_blockable(range))
non_block_start();
_ret = mn->ops->invalidate_range_start(mn, range);
if (!mmu_notifier_range_blockable(range))
non_block_end();
If someone Acks all the sched changes then I can pick this for hmm.git, but I still think the existing pre-emption debugging is fine for this use case.
Ok, I'll ping Peter Z. for an ack, iirc he was involved.
Also, same comment as for the lockdep map, this needs to apply to the non-blocking range_end also.
Hm, I thought the page table locks we're holding there already prevent any sleeping, so would be redundant?
AFAIK no. All callers of invalidate_range_start/end pairs do so a few lines apart and don't change their locking in between - thus since start can block so can end.
Would love to know if that is not true??
Similarly I've also been idly wondering if we should add a 'might_sleep()' to invalidate_rangestart/end() to make this constraint clear & tested to the mm side?
Jason
On Wed, Aug 21, 2019 at 9:33 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Aug 20, 2019 at 05:18:10PM +0200, Daniel Vetter wrote:
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 538d3bb87f9b..856636d06ee0 100644 +++ b/mm/mmu_notifier.c @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) {
int _ret = mn->ops->invalidate_range_start(mn, range);
int _ret;
if (!mmu_notifier_range_blockable(range))
non_block_start();
_ret = mn->ops->invalidate_range_start(mn, range);
if (!mmu_notifier_range_blockable(range))
non_block_end();
If someone Acks all the sched changes then I can pick this for hmm.git, but I still think the existing pre-emption debugging is fine for this use case.
Ok, I'll ping Peter Z. for an ack, iirc he was involved.
Also, same comment as for the lockdep map, this needs to apply to the non-blocking range_end also.
Hm, I thought the page table locks we're holding there already prevent any sleeping, so would be redundant?
AFAIK no. All callers of invalidate_range_start/end pairs do so a few lines apart and don't change their locking in between - thus since start can block so can end.
Would love to know if that is not true??
Yeah I reviewed them, I think I mixed up a discussion I had a while ago with Jerome. It's a bit tricky to follow in the code since in some places ->invalidate_range and ->invalidate_range_end seem to be called from the same place, in others not at all.
Similarly I've also been idly wondering if we should add a 'might_sleep()' to invalidate_rangestart/end() to make this constraint clear & tested to the mm side?
Hm, sounds like a useful idea. Since in general you wont test with mmu notifiers, but they could happen, and then they will block for at least some mutex usually. I'll throw that as an idea on top for the next round. -Daniel
On Tue, Aug 20, 2019 at 05:18:10PM +0200, Daniel Vetter wrote:
On Tue, Aug 20, 2019 at 10:34:18AM -0300, Jason Gunthorpe wrote:
On Tue, Aug 20, 2019 at 10:19:02AM +0200, Daniel Vetter wrote:
We need to make sure implementations don't cheat and don't have a possible schedule/blocking point deeply burried where review can't catch it.
I'm not sure whether this is the best way to make sure all the might_sleep() callsites trigger, and it's a bit ugly in the code flow. But it gets the job done.
Inspired by an i915 patch series which did exactly that, because the rules haven't been entirely clear to us.
v2: Use the shiny new non_block_start/end annotations instead of abusing preempt_disable/enable.
v3: Rebase on top of Glisse's arg rework.
v4: Rebase on top of more Glisse rework.
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: "Christian König" christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: "Jérôme Glisse" jglisse@redhat.com Cc: linux-mm@kvack.org Reviewed-by: Christian König christian.koenig@amd.com Reviewed-by: Jérôme Glisse jglisse@redhat.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com mm/mmu_notifier.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 538d3bb87f9b..856636d06ee0 100644 +++ b/mm/mmu_notifier.c @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) { if (mn->ops->invalidate_range_start) {
int _ret = mn->ops->invalidate_range_start(mn, range);
int _ret;
if (!mmu_notifier_range_blockable(range))
non_block_start();
_ret = mn->ops->invalidate_range_start(mn, range);
if (!mmu_notifier_range_blockable(range))
non_block_end();
If someone Acks all the sched changes then I can pick this for hmm.git, but I still think the existing pre-emption debugging is fine for this use case.
Ok, I'll ping Peter Z. for an ack, iirc he was involved.
Also, same comment as for the lockdep map, this needs to apply to the non-blocking range_end also.
Hm, I thought the page table locks we're holding there already prevent any sleeping, so would be redundant? But reading through code I think that's not guaranteed, so yeah makes sense to add it for invalidate_range_end too. I'll respin once I have the ack/nack from scheduler people.
So I started to look into this, and I'm a bit confused. There's no _nonblock version of this, so does this means blocking is never allowed, or always allowed?
From a quick look through implementations I've only seen spinlocks, and
one up_read. So I guess I should wrape this callback in some unconditional non_block_start/end, but I'm not sure.
Thanks, Daniel
Anyhow, since this series has conflicts with hmm.git it would be best to flow through the whole thing through that tree. If there are no remarks on the first two patches I'll grab them in a few days.
Thanks, Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Aug 21, 2019 at 05:41:51PM +0200, Daniel Vetter wrote:
Hm, I thought the page table locks we're holding there already prevent any sleeping, so would be redundant? But reading through code I think that's not guaranteed, so yeah makes sense to add it for invalidate_range_end too. I'll respin once I have the ack/nack from scheduler people.
So I started to look into this, and I'm a bit confused. There's no _nonblock version of this, so does this means blocking is never allowed, or always allowed?
RDMA has a mutex:
ib_umem_notifier_invalidate_range_end rbt_ib_umem_for_each_in_range invalidate_range_start_trampoline ib_umem_notifier_end_account mutex_lock(&umem_odp->umem_mutex);
I'm working to delete this path though!
nonblocking or not follows the start, the same flag gets placed into the mmu_notifier_range struct passed to end.
From a quick look through implementations I've only seen spinlocks, and one up_read. So I guess I should wrape this callback in some unconditional non_block_start/end, but I'm not sure.
For now, we should keep it the same as start, conditionally blocking.
Hopefully before LPC I can send a RFC series that eliminates most invalidate_range_end users in favor of common locking..
Jason
On Thu, Aug 22, 2019 at 10:16 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Aug 21, 2019 at 05:41:51PM +0200, Daniel Vetter wrote:
Hm, I thought the page table locks we're holding there already prevent any sleeping, so would be redundant? But reading through code I think that's not guaranteed, so yeah makes sense to add it for invalidate_range_end too. I'll respin once I have the ack/nack from scheduler people.
So I started to look into this, and I'm a bit confused. There's no _nonblock version of this, so does this means blocking is never allowed, or always allowed?
RDMA has a mutex:
ib_umem_notifier_invalidate_range_end rbt_ib_umem_for_each_in_range invalidate_range_start_trampoline ib_umem_notifier_end_account mutex_lock(&umem_odp->umem_mutex);
I'm working to delete this path though!
nonblocking or not follows the start, the same flag gets placed into the mmu_notifier_range struct passed to end.
Ok, makes sense.
I guess that also means the might_sleep (I started on that) in invalidate_range_end also needs to be conditional? Or not bother with a might_sleep in invalidate_range_end since you're working on removing the last sleep in there?
From a quick look through implementations I've only seen spinlocks, and one up_read. So I guess I should wrape this callback in some unconditional non_block_start/end, but I'm not sure.
For now, we should keep it the same as start, conditionally blocking.
Hopefully before LPC I can send a RFC series that eliminates most invalidate_range_end users in favor of common locking..
Thanks, Daniel
On Thu, Aug 22, 2019 at 10:42:39AM +0200, Daniel Vetter wrote:
RDMA has a mutex:
ib_umem_notifier_invalidate_range_end rbt_ib_umem_for_each_in_range invalidate_range_start_trampoline ib_umem_notifier_end_account mutex_lock(&umem_odp->umem_mutex);
I'm working to delete this path though!
nonblocking or not follows the start, the same flag gets placed into the mmu_notifier_range struct passed to end.
Ok, makes sense.
I guess that also means the might_sleep (I started on that) in invalidate_range_end also needs to be conditional? Or not bother with a might_sleep in invalidate_range_end since you're working on removing the last sleep in there?
I might suggest the same pattern as used for locked, the might_sleep unconditionally on the start, and a 2nd might sleep after the IF in __mmu_notifier_invalidate_range_end()
Observing that by audit all the callers already have the same locking context for start/end
Jason
On Thu, Aug 22, 2019 at 4:24 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Thu, Aug 22, 2019 at 10:42:39AM +0200, Daniel Vetter wrote:
RDMA has a mutex:
ib_umem_notifier_invalidate_range_end rbt_ib_umem_for_each_in_range invalidate_range_start_trampoline ib_umem_notifier_end_account mutex_lock(&umem_odp->umem_mutex);
I'm working to delete this path though!
nonblocking or not follows the start, the same flag gets placed into the mmu_notifier_range struct passed to end.
Ok, makes sense.
I guess that also means the might_sleep (I started on that) in invalidate_range_end also needs to be conditional? Or not bother with a might_sleep in invalidate_range_end since you're working on removing the last sleep in there?
I might suggest the same pattern as used for locked, the might_sleep unconditionally on the start, and a 2nd might sleep after the IF in __mmu_notifier_invalidate_range_end()
Observing that by audit all the callers already have the same locking context for start/end
My question was more about enforcing that going forward, since you're working to remove all the sleeps from invalidate_range_end. I don't want to add debug annotations which are stricter than what the other side actually expects. But since currently there is still sleeping locks in invalidate_range_end I think I'll just stick them in both places. You can then (re)move it when the cleanup lands. -Daniel
dri-devel@lists.freedesktop.org